C++ Benchmarks: Container Lookups

Sep 5, 2018 00:36 · 1116 words · 6 minute read Tags:

UPDATE: An erroneous version of this post was published which used the wrong code to generate the results. While the data itself was accurate, the conclusions were misleading because of how they were presented in the post. This post has now been updated with the correct data and results. Huge thank you to @ForrestTheWoods for catching this.

If you write code, you have probably used a set before and are familliar with big-O notation to talk about performance. Big-O notation, however, oversimplifies the analysis by getting rid of constants and generally applies for large values of n. I was curious about how sets stack up against a simple array scan for checking membership of a value. In this post, I benchmark and explore the performance difference between BSTs, associative hash sets, and arrays just for lookups.

Methodology

This benchmark uses Google’s benchmarking library. In this post I present the results and my intuition behind why the results are the way they are. However, intuition is not scientific and I might post subsequently with a more in depth analysis (perhaps using perf) to explain why we see some of the results we do.

In one setup, I used std::strings of length 64 and in the other I used 64 bit numbers. The main motivation behind measuring both numbers and strings was that comparing 64 bit numbers takes a constant number of clock cycles (rather than comparing strings which requires O(n) cycles and memory reads). In each experiment, I inserted n (from 32 to 32768) randomly generated objects, and then did 32768 lookups on randomly generated objects. Note that this would cause almost every object looked up to not be present.

Before getting into the results, let’s briefly discuss the data structures. The set is a balanced BST (usually a red black tree). The unordered_set is an unordered associative container. It’s basically a vector of buckets that map to a linked list of key value pairs containing items that hash to that bucket. Finally, a vector is an unbounded list with dynamic sizing allocated on the heap.

Container Complexity To Check Membership
set O(lg n)
unordered_set O(1)
vector O(n)

From this analysis, one would expect the best data structure to use for the task would be an unordered_set.

Strings (len 64)

The code for this experiment can be found here.

Let’s start off simply comparing the set and the unordered_set:

LookupStrResults2

As we can see here, the unordered_set pretty significantly outperforms the set. Obviously the Big-O analysis above supports this, but at the small sizes we’ve benchmarked, even lg n is small (no bigger than 15). So it’s important to look at the constant factor each data structure adds.

In the case of the unordered_set, each element must be hashed. Once it is hashed, it must do a (potentially) random read into memory to get the data in the bucket. Assuming no collisions, we can stop here, but in the case of collisions, the linked list must be traversed and the keys must be compared. On average though, we do one hash (O(n) size of key), 2 remote memory fetches (one to the vector and one to the linked list node) and one key comparison.

In the case of the set, we do far more (potentially) random reads into memory because each tree node can be anywhere. Furthermore, we actually do O(lg n) string comparisons to check for equality which are all linear in the size of the string (as opposed to hashing once and eliminating many comparisons).

So, the moral of this story is, unless you need the ordered aspect of set, just use an unordered_set.

Next, let’s look at the performance of binary searching a sorted vector. This trades off the random memory accesses of the set with contiguous memory accesses in a vector. The downside of this approach of course is the vector has to be pre-populated; it’s not very practical to dynamically build up a sorted vector.

LookupStrResults4

As can be seen, the sorted vector outperforms the set, but the unordered_set beats both. This is most likely due to the fact that when looking up strings, the cost to compare the strings itself is expensive and requires fetching the string data from remote memory.

Finally, if we compare all 4 we can see, as expected, that the vector scan is linear and far worse than the rest:

LookupStrResults5

What surprises me is the fact that even at small values, the vector scan performs worse. What I suspect, however, is that the cost of string comparisons is high and probably dominates the lookup. I will have to use perf to dig deeper. But, I was curious to see if the results differ with numbers which are in-situ and can be compared in one clock cycle.

64 Bit Numbers

This experiment was the same as above except the objects were 64 bit numbers. The code can be found here.

In this case, I am particularly interested in looking at how this performs for smaller values:

LookupIntResults1

As we can see, the vector actually outperforms both sets up to ~125 elements. Until 125 elements, it would seem that the cache locality of doing successive memory lookups is less than the constant factor introduced through hashing, jumping to a bucket, and doing the head lookup on the linked list chain.

Conclusion

I hope this post piqued your interest and demonstrated that big-O theory does not always tell the full story. Usually in real software, constants matter, and the only way to tell what will be better is to benchmark and profile it. Obviously this post just scratched the surface, but maybe this intuition will help save you some clock cycles. It has certainly piqued my interest to dive deeper using something like perf and seeing where the clock cycles are truly spent.

Appendix

Here are the raw results for the string benchmark (time in ms):

Size vector set unordered_set vector (binary search)
32 8.21 4.85 4.41 5.11
64 12.72 5.53 4.53 5.66
128 22.43 5.93 4.29 5.94
256 40.50 6.39 4.35 6.54
512 79.58 6.97 4.52 6.69
1024 153.00 7.80 5.15 7.21
2048 307.01 8.47 5.05 8.49
4096 598.65 10.80 5.65 10.04
8192 1,212.08 12.65 6.77 11.53
16384 2,508.33 18.00 10.75 13.74
32768 5,420.63 31.31 16.99 19.75

And here are the results for the uint64_t benchmark (again, time in ms):

Size vector set unordered_set vector (binary search)
32 0.26 0.80 0.88 0.79
64 0.50 1.00 0.86 0.94
128 0.92 1.33 0.84 1.16
256 1.78 1.59 0.85 1.25
512 3.80 1.83 0.82 1.64
1024 7.10 2.24 0.87 1.80
2048 15.01 2.60 1.00 2.05
4096 27.94 3.25 1.20 2.14
8192 71.69 4.13 1.59 2.55
16384 170.13 5.86 2.34 2.81
32768 393.57 9.24 3.97 3.20
tweet Share