C++ Benchmarks: Container Lookups
Sep 5, 2018 00:36 · 1116 words · 6 minute read
UPDATE: An erroneous version of this post was published which used the wrong code to generate the results. While the data itself was accurate, the conclusions were misleading because of how they were presented in the post. This post has now been updated with the correct data and results. Huge thank you to @ForrestTheWoods for catching this.
If you write code, you have probably used a set before and are familliar with big-O notation to talk about performance. Big-O notation, however, oversimplifies the analysis by getting rid of constants and generally applies for large values of n. I was curious about how sets stack up against a simple array scan for checking membership of a value. In this post, I benchmark and explore the performance difference between BSTs, associative hash sets, and arrays just for lookups.
Methodology
This benchmark uses Google’s benchmarking library. In this post I present the results and my intuition behind why the results are the way they are. However, intuition is not scientific and I might post subsequently with a more in depth analysis (perhaps using perf
) to explain why we see some of the results we do.
In one setup, I used std::string
s of length 64 and in the other I used 64 bit numbers. The main motivation behind measuring both numbers and strings was that comparing 64 bit numbers takes a constant number of clock cycles (rather than comparing strings which requires O(n)
cycles and memory reads). In each experiment, I inserted n
(from 32
to 32768
) randomly generated objects, and then did 32768
lookups on randomly generated objects. Note that this would cause almost every object looked up to not be present.
Before getting into the results, let’s briefly discuss the data structures. The set is a balanced BST (usually a red black tree). The unordered_set is an unordered associative container. It’s basically a vector of buckets that map to a linked list of key value pairs containing items that hash to that bucket. Finally, a vector is an unbounded list with dynamic sizing allocated on the heap.
Container | Complexity To Check Membership |
---|---|
set | O(lg n) |
unordered_set | O(1) |
vector | O(n) |
From this analysis, one would expect the best data structure to use for the task would be an unordered_set
.
Strings (len 64)
The code for this experiment can be found here.
Let’s start off simply comparing the set
and the unordered_set
:
As we can see here, the unordered_set
pretty significantly outperforms the set
. Obviously the Big-O analysis above supports this, but at the small sizes we’ve benchmarked, even lg n
is small (no bigger than 15). So it’s important to look at the constant factor each data structure adds.
In the case of the unordered_set
, each element must be hashed. Once it is hashed, it must do a (potentially) random read into memory to get the data in the bucket. Assuming no collisions, we can stop here, but in the case of collisions, the linked list must be traversed and the keys must be compared. On average though, we do one hash (O(n)
size of key), 2 remote memory fetches (one to the vector
and one to the linked list node) and one key comparison.
In the case of the set
, we do far more (potentially) random reads into memory because each tree node can be anywhere. Furthermore, we actually do O(lg n)
string comparisons to check for equality which are all linear in the size of the string (as opposed to hashing once and eliminating many comparisons).
So, the moral of this story is, unless you need the ordered aspect of set
, just use an unordered_set
.
Next, let’s look at the performance of binary searching a sorted vector
. This trades off the random memory accesses of the set
with contiguous memory accesses in a vector. The downside of this approach of course is the vector has to be pre-populated; it’s not very practical to dynamically build up a sorted vector.
As can be seen, the sorted vector
outperforms the set
, but the unordered_set
beats both. This is most likely due to the fact that when looking up strings, the cost to compare the strings itself is expensive and requires fetching the string data from remote memory.
Finally, if we compare all 4 we can see, as expected, that the vector scan is linear and far worse than the rest:
What surprises me is the fact that even at small values, the vector scan performs worse. What I suspect, however, is that the cost of string comparisons is high and probably dominates the lookup. I will have to use perf
to dig deeper. But, I was curious to see if the results differ with numbers which are in-situ and can be compared in one clock cycle.
64 Bit Numbers
This experiment was the same as above except the objects were 64 bit numbers. The code can be found here.
In this case, I am particularly interested in looking at how this performs for smaller values:
As we can see, the vector actually outperforms both sets up to ~125 elements. Until 125 elements, it would seem that the cache locality of doing successive memory lookups is less than the constant factor introduced through hashing, jumping to a bucket, and doing the head lookup on the linked list chain.
Conclusion
I hope this post piqued your interest and demonstrated that big-O theory does not always tell the full story. Usually in real software, constants matter, and the only way to tell what will be better is to benchmark and profile it. Obviously this post just scratched the surface, but maybe this intuition will help save you some clock cycles. It has certainly piqued my interest to dive deeper using something like perf
and seeing where the clock cycles are truly spent.
Appendix
Here are the raw results for the string benchmark (time in ms):
Size | vector |
set |
unordered_set |
vector (binary search) |
---|---|---|---|---|
32 | 8.21 | 4.85 | 4.41 | 5.11 |
64 | 12.72 | 5.53 | 4.53 | 5.66 |
128 | 22.43 | 5.93 | 4.29 | 5.94 |
256 | 40.50 | 6.39 | 4.35 | 6.54 |
512 | 79.58 | 6.97 | 4.52 | 6.69 |
1024 | 153.00 | 7.80 | 5.15 | 7.21 |
2048 | 307.01 | 8.47 | 5.05 | 8.49 |
4096 | 598.65 | 10.80 | 5.65 | 10.04 |
8192 | 1,212.08 | 12.65 | 6.77 | 11.53 |
16384 | 2,508.33 | 18.00 | 10.75 | 13.74 |
32768 | 5,420.63 | 31.31 | 16.99 | 19.75 |
And here are the results for the uint64_t benchmark (again, time in ms):
Size | vector |
set |
unordered_set |
vector (binary search) |
---|---|---|---|---|
32 | 0.26 | 0.80 | 0.88 | 0.79 |
64 | 0.50 | 1.00 | 0.86 | 0.94 |
128 | 0.92 | 1.33 | 0.84 | 1.16 |
256 | 1.78 | 1.59 | 0.85 | 1.25 |
512 | 3.80 | 1.83 | 0.82 | 1.64 |
1024 | 7.10 | 2.24 | 0.87 | 1.80 |
2048 | 15.01 | 2.60 | 1.00 | 2.05 |
4096 | 27.94 | 3.25 | 1.20 | 2.14 |
8192 | 71.69 | 4.13 | 1.59 | 2.55 |
16384 | 170.13 | 5.86 | 2.34 | 2.81 |
32768 | 393.57 | 9.24 | 3.97 | 3.20 |