Some low-level networking guys (Lennert Buytenhek, Robert Olsson, ..) have figured yet another reason why network performance with high pps (packets per second) rates sucks so much on commodity hardware (all PCI / PCI-x / PCI express based systems).
The 'new' culprit is MMIO read latency. When you're inside a network driver interrupt handler (well, same is true for about any such handler), the first thing you usually do is read the devices' "Interrupt Status Register(s)" to find out whether the device really originated that interrupt, and which condition (TX completion, RX completion, ...) caused it.
Depending on the NIC and driver design, you do multiple reads (and writes, but writes are not that bad) within the IRQ handler.
Lennert has hacked up a tool called mmio_test to benchmark the number of CPU cycles spent. Robert improved it a bit, and I've now added support for multiple network adapters, scheduling on multiple CPUs and other bits.
In case you're interested, it is (as usual) available from my svn server. In case you want to send me some numbers, please always include /proc/cpuinfo and "lspci -v -t" output, otherwise the numbers are useless.