After recent discussions with Robert Olsson at the netfilter workshop, I've decided to investigate a bit further, why the Intel e1000 gigabit MAC's are quite limited when it comes to TX performance and large numbers of pps.
My first assumption was that the in-kernel pktgen.c code might not keep the transmitter busy at all times, resulting in only 760kpps (out of the theoretical maximum of 1480kpps).
So I hacked the e1000 driver to hardcode a refill of the Tx queue with the same skb over and over again. Using a 2048 Tx descriptor ring, I was able to keep the transmitter busy at all times (E1000_ICR_TXQE interrupts).
Unfortunately, I still didn't get more than the 760kpps in this setup (PCI-X, 66MHz, Dual-Opteron 1.4GHz, DDR-333 (PC-2700) RAM. So either we're seeing a limitation of the 82546 chip, or the PCI-X bus / memory latency / whatever.
I'll try the same experiments on a different machine with PCI-X 100 / 133MHz in order to find out what exactly is causing this limit.