> When the hash table gets too full, we need to allocate a larger array and move... (2024)

thamer on March 26, 2021 | parent | context | favorite | on: How to implement a hash table in C

> When the hash table gets too full, we need to allocate a larger array and move the items over. This is absolutely required when the number of items in the hash table has reached the size of the array, but usually you want to do it when the table is half or three-quarters full.

Yes, but how you resize is important too: if you have a threshold size (like 3/4 full) at which you block and re-distribute all elements into a new array, you will incur a significant pause when this happens, e.g. https://log.kv.io/post/2009/05/15/leaky-hashtables

Instead, when you reach the threshold amount you can create the new array and then gradually migrate the keys in small batches either with each operation or with a background thread. So on `get` if we're in the process of resizing: first check the new table, then the old one, and before returning migrate N keys from the old table to the new one. Free the old array once all the keys are migrated.

I wrote a small hash table implementation with gradual re-hashing a while back, search for DICT_MAX_LOAD and dict_rehash here: https://github.com/nicolasff/ht/blob/master/dict.c#L193

> When the hash table gets too full, we need to allocate a larger array and move... (1)

OskarS on March 26, 2021 | next [–]

It’s usually not the right call to use a hash table like you’re describing. You’re not saving any work (all keys have to be migrated sooner or later) and by doing it like this you incur a non-trivial amount of overhead on most operations.

Hash tables generally work on the assumption that operations are O(1) in the amortized sense, that individual inserts might be expensive because of rehashing, but in aggregate they are very cheap. In practice, this is almost always a fine assumption, and it is how the vast majority of hash tables in all programming languages work.

The kinds of hash tables you are talking about have their uses in places where you have soft real-time constraints, but that is honestly a pretty niche use case.

> When the hash table gets too full, we need to allocate a larger array and move... (2)

pjam on March 26, 2021 | parent | next [–]

The progressive rehashing described in the article is very similar to what Redis does [1].

Just thought I'd share a example of a use case where the incremental rehashing logic makes sense.

[1]: https://github.com/redis/redis/blob/unstable/src/dict.c

> When the hash table gets too full, we need to allocate a larger array and move... (3)

jeffffff on March 26, 2021 | parent | prev | next [–]

iirc memcached either uses or used to use https://en.wikipedia.org/wiki/Linear_hashing to avoid rehashing induced pauses. it definitely makes sense for that use case but you're correct that it's not the right call for a general purpose hash table implementation.

> When the hash table gets too full, we need to allocate a larger array and move... (4)

legulere on March 26, 2021 | parent | prev | next [–]

If your application is not negatively affected by such jitter why not use a garbage collected language?

> When the hash table gets too full, we need to allocate a larger array and move... (5)

theamk on March 29, 2021 | root | parent | next [–]

Programs have different phases and jitter might be relevant only in some of them.

Imagine hash tables used to store configuration data. You don't care about performance of "reload config" operation, but once config is loaded and the worker is handling traffic, read operations must be fast.

> When the hash table gets too full, we need to allocate a larger array and move... (6)

MauranKilom on March 27, 2021 | root | parent | prev | next [–]

Maybe you still want fine-grained control over code generation and memory allocation. But it's a valid question to ask.

> When the hash table gets too full, we need to allocate a larger array and move... (7)

jbluepolarbear on March 27, 2021 | parent | prev | next [–]

I agree. I’m rarely using a hash tables for growing data sets. In my experience, I’ve been more inclined to redesign the algorithm or use a different data structures at the point were a migrating hash table makes sense.

> When the hash table gets too full, we need to allocate a larger array and move... (8)

sgerenser on March 26, 2021 | parent | prev | next [–]

Agreed this seems like major overkill, but then again so does writing your own hash table from scratch in C. :shrug:

> When the hash table gets too full, we need to allocate a larger array and move... (9)

ModernMech on March 26, 2021 | parent | prev | next [–]

> Hash tables generally work on the assumption that operations are O(1) in the amortized sense, that individual inserts might be expensive because of rehashing, but in aggregate they are very cheap.

Hash tables aren't O(1) in an amortized sense, they are O(1) in a statistical sense. The O(1) complexity of hash tables is dependent entirely on the size of the underlying array, the method of collision resolution, and the quality of the (pre)hash function in relation to the distribution of data you're storing in the table. Whether you insert 1 item or 1 million items into the table, you're still getting O(1) computational complexity if the table is designed well. If not, you could be getting O(n) complexity in the worst case.

While the fixed cost of hashing is irrelevant in the O(1) nature of hash table operations, it does mean that a hash table in practice could be slower (in terms of wall time) than a linked list for insert/find/remove operations. Despite a linked list being O(n) for those operations, a linked list doesn't have to pay the fixed cost of hashing. So for low values of n, the linked list might be a better choice. I think this is what you mean by "O(1) in the amortized sense" but it's not what we mean when we say the computational complexity of these operations on a hash table is O(1).

What the OP is talking about is an example of amortization of resize(), essentially adding a fixed cost to insert() and remove() so you don't have to pay a huge sum when you call resize(). But that doesn't mean insert() and remove() are no longer O(1).

> When the hash table gets too full, we need to allocate a larger array and move... (10)

justinpombrio on March 26, 2021 | root | parent | next [–]

Amortized cost is a technical term in complexity analysis:https://en.wikipedia.org/wiki/Amortized_analysis

If you use the terminology precisely, the worst case cost of a regular hash table insertion is O(n) because of the potential resize, but its amortized cost is O(1) because that's what it "averages out to".

> When the hash table gets too full, we need to allocate a larger array and move... (11)

vladf on March 26, 2021 | prev | next [–]

What happens if you have a sequence of puts that triggers an expansion and some keys get migrated, but all of a sudden you get a series of removes, such that you need to perform a shrink to keep memory usage linear?

Do you now end up with three live arrays? You can probably make things even more pathological...

> When the hash table gets too full, we need to allocate a larger array and move... (12)

coliveira on March 26, 2021 | parent | next [–]

Hash tables are not great for data structures that change dramatically in size. If you have this situation, the best thing to do is to copy one hash table into another and discard the original data structure.

> When the hash table gets too full, we need to allocate a larger array and move... (13)

vladf on March 28, 2021 | root | parent | next [–]

I think parent’s construction works just fine if you disallow removes, so it grows monotonically, regardless of how dramatic the increase in keys, so long as you move over 2 keys for every update to your map.

My comment was a description of a precise “drastic change” where incrementality becomes sophisticated. (Though in part I was hoping for a response which identifies an elegant approach without the complication I mentioned.)

> When the hash table gets too full, we need to allocate a larger array and move... (14)

rurban on March 27, 2021 | root | parent | prev | next [–]

Hash tables do that automatically.

Rehashing is controlled by various tunable strategies: Load factor (0.5-1), growth policy (primed or power2), growth factor (1.5, golden ratio or 2).

Linked list tables don't need to throw away nodes, they can just be relinked. But compaction is always better, because cache dominates. Most rehash do exactly that. Esp. multithreaded.

The best thing to do is to take a good one and don't touch it, unless you want a monster as in perl5. Eg the recent siphash security theater broke performance in all dynamic languages, even the linux kernel. Everything is perl now. You don't want that.

> When the hash table gets too full, we need to allocate a larger array and move... (15)

_dax6 on March 26, 2021 | prev | next [–]

I'm not a particularly good programmer, so I may be missing something here, but I'd think that moving the items should be its own function that happens in between gets and sets, no? Besides adding a decidedly non-read-only operation to what should be read-only, shouldn't the movement happen when you're not trying to do anything with the hash table?

> When the hash table gets too full, we need to allocate a larger array and move... (16)

jeffffff on March 26, 2021 | parent | next [–]

typically rehashing is done during insert. if you were going to do it while not doing anything else with the hash table you'd have to use a background thread, and then you'd have to do some form of synchronization, so there would be significant additional complexity and overhead.

> When the hash table gets too full, we need to allocate a larger array and move... (17)

nyanpasu64 on March 27, 2021 | prev | next [–]

If get() mutates the array internally, does that mean even "read-only" operations aren't thread-safe?

> When the hash table gets too full, we need to allocate a larger array and move... (18)

kbenson on March 26, 2021 | prev | next [–]

> So on `get` if we're in the process of resizing

Not just get, but set also, right? Otherwise I would think a large group of set operations that overflowed more than once might be problematic to deal with. Or maybe I'm missing something that makes that a non issue?

> When the hash table gets too full, we need to allocate a larger array and move... (19)

tkhattra on March 26, 2021 | parent | next [–]

yes, set also. go's map implementation uses this technique - allocate a new array of buckets and gradually migrate from old to new buckets (i think it does the migration during set and delete only, not get).

> When the hash table gets too full, we need to allocate a larger array and move... (20)

stephc_int13 on March 26, 2021 | prev | next [–]

You can also allocate a small hash table at first, and reallocate it in a large virtual memory region once it about the size of a page (4k) so you won't have to reallocate it later.

> When the hash table gets too full, we need to allocate a larger array and move... (21)

chasd00 on March 26, 2021 | prev [–]

can you do hashtable sharding? like have a hashtable point to another hashtable by using something appended to the key? just a thought on a procrastination filled Friday afternoon.

> When the hash table gets too full, we need to allocate a larger array and move... (22)

mateo411 on March 26, 2021 | parent [–]

Of course! This is a distributed hash table. You can have one hashing function determine the shard and another determine the index in the hash table.

FAQs

> When the hash table gets too full, we need to allocate a larger array and move...? ›

> When the hash table gets too full, we need to allocate a larger array and move the items over. This is absolutely required when the number of items in the hash table has reached the size of the array, but usually you want to do it when the table is half or three-quarters full.

Read On ›

What happens when the hash table becomes full in chaining? ›

Because each bucket is a linked list, a chained hash table is not limited to a fixed number of elements. However, performance degrades if the table becomes too full.

Discover More Details ›

When should you extend a hash table? ›

The Hash Table is resized when the load factor hits a certain threshold. If we get too aggressive or too lenient, we would not be able to get the optimal efficiency. Hence, we have to find a sweet spot. We typically grow the hash table when the load factor hits 0.5 and shrink when we hit 0.125.

When should you use hash tables over an array? ›

Arrays are faster, especially when reading or writing large numbers of consecutive values, but Hash tables are faster when the values need to be read or written in random order by the key.

See Details ›

What is the disadvantage of array over hash table? ›

Efficient insertion and deletion: Hashes are efficient at inserting and deleting elements because they only need to update one array index for each operation. In contrast, linked lists or arrays require shifting elements around when inserting or deleting elements.

Find Out More ›

What happens if a hash table is full? ›

> When the hash table gets too full, we need to allocate a larger array and move the items over. This is absolutely required when the number of items in the hash table has reached the size of the array, but usually you want to do it when the table is half or three-quarters full.

Tell Me More ›

How full should a hash table be? ›

But a good general “rule of thumb” is: The hash table should be an array with length about 1.3 times the maximum number of keys that will actually be in the table, and. Size of hash table array should be a prime number.

Show Me More ›

What happens when a hash table is resized? ›

To resize means to change the length of the array (usually) of buckets, then re-bucket all the items in the hash table. Per-bucket chain length goes down on average, which means fewer collisions.

Explore More ›

How to increase the capacity of a hashtable? ›

rehash. Increases the capacity of and internally reorganizes this hashtable, in order to accommodate and access its entries more efficiently. This method is called automatically when the number of keys in the hashtable exceeds this hashtable's capacity and load factor.

What are the rules of hash table? ›

To store an element in the hash table you must insert it into a specific linked list. If there is any collision (i.e. two different elements have same hash value) then store both the elements in the same linked list. The cost of a lookup is that of scanning the entries of the selected linked list for the required key.

Show Me More ›

When shouldn't you use a hash table? ›

Items in a hash table are ordered by their hash value, which isn't an order that makes any sense to people. If you want to traverse the contents in order, then a hash table isn't a good choice.

Read The Full Story ›

Why use an array over a HashMap? ›

Arrays are for simpler static structures where the length can be known at initialisation time. HashMaps are dictionaries for when your index isn't an integer, the index is derived instead by calling the hash method of the objects you're using for keys.

See Details ›

Why would you use an ArrayList over an array? ›

While Java arrays are fixed in size (the size cannot be modified), an ArrayList allows flexibility by being able to both add and remove elements.

Get More Info Here ›

What are the weaknesses of hash table? ›

Hash tables offer efficient data storage and retrieval, but they come with some drawbacks. These include collision resolution, variable performance, space overhead, lack of ordered data, and dependency on a quality hash function. They are not ideal for range queries, and resizing can introduce overhead.

What is the problem with hash tables? ›

Hash tables in general exhibit poor locality of reference—that is, the data to be accessed is distributed seemingly at random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor cache misses that cause long delays.

Why HashTable is faster than array? ›

Hash tables tend to be faster when it comes to searching for items. In arrays, you have to loop over all items before you find what you are looking for while in a hash table you go directly to the location of the item. Inserting an item is also faster in Hash tables since you just hash the key and insert it.

View Details ›

What is the disadvantage of hashing with chaining? ›

3. What is the disadvantage of hashing with chaining? Explanation: Hashing with separate chaining has a disadvantage that it takes more space. This space is used for storing elements in case of a collision.

What is full table in hashing? ›

A Hash table is defined as a data structure used to insert, look up, and remove key-value pairs quickly. It operates on the hashing concept, where each key is translated by a hash function into a distinct index in an array. The index functions as a storage location for the matching value.

Learn More ›

What are the consequences of increasing the number of hash functions? ›

So, increasing the number of hash functions (k) has two effects, one of which increases the likelihood of false positives, and the other of which decreases the likelihood of false positives. To figure out how to balance those two, you need to work through the math.

Discover More Details ›

What happens when a hash collision occurs in a HashMap? ›

Collisions in the HashMap

A collision, or more specifically, a hash code collision in a HashMap, is a situation where two or more key objects produce the same final hash value and hence point to the same bucket location or array index.

Show Me More ›