Reworking the Linux neighbour cache
Since I've lately had some customer issues with regard to neighbour cache overflows, I studied the current code quite a bit. From my point of view, it has a couple of shortcomings.
The general problem goes like this: What do we do, if we're attached to let's say a /16 (formerly 'Class B') network that has a theoretical limit of 65535 neighbours at layer 2, and somebody sends us a single packet for every one of those neighbours. We now start to send ARP requests for all those neighbours, and until those time out (1sec default), thus flooding our neigbour table. The current Linux strategy is to configure a static limit (default: 1024), and as soon as we reach the limit, we start deleting old entries. 'old' entries are those for real hosts to which we've recently had connectivity... We do not expire any of the incomplete neighbour entries in order to avoid ARP-floods.
So if you want to avoid that, you always have to set the gc_thresh3 value to at least the theoretical number of total machines that could be directly reachable at layer 2. While this is not a problem with /16, it suddenly becomes one with /8, or with the extremely large IPv6 prefixes.
The problem is further increased, since the number of hash buckets is very low (static number of 32), and the used hash algorithm apparently has a bad distribution. So either we increase the hash table, increase the number of buckets and improve the hash algorithm, or we change the expiration scheme to also drop incomplete entries. But the current situation is definitely not good.
So I picked up some old 2.4.x patches from Tim Gardner, ported them to 2.6.x and brushed them up. The number of hash buckets is now a kernel boot parameter (if not specified, the hash is dynamically sized, like the TCP syn-queue, fragment queue or ip_conntrack hash). The hashing algorithm now uses a Jenkins hash, just like all other parts of the kernel use, too. The patch is in testing at my machines at the moment, but I think I'll push it soon.