Bloom filter vs. CPU cache

I was playing around with a bloom filter today and drawing some graphs on performance vs. various metrics. While for my real use case the filter bit array is larger, just for fun I wanted to look at how the performance changes as the array size exceeds various CPU caches.

I’m running on an AMD Phenom II X4 925 Processor which has 64K L1 cache (per core), 512K L2 cache (per core) and 6MB L3 cache (shared). The bloom filter code is single threaded.

The following graph shows the time (yellow line) taken to insert 10 million entries into the bloom filter as the size of the bit array (red line) increases linearly (the two lines are not on the same y-axis scale). For the first two-thirds of the graph the time taken is just over 2 seconds, or just under 5 million elements per second. The sudden change in the slope of the line is near 512K at which point the array no longer fits the L2 cache.

The next graph zooms out to show the size of the bit array increasing all the way to 18MB. The second slope change (about a third of the way from the left)  is in the neighborhood of 6MB, corresponding to the L3 cache size.