D3-Module 03 - Hash Tables II
Number Bases & Characters
Original:
Objective 01 - Understand hash collisions and use a linked list for collision resolution in a user-defined Hashable class
Overview
Remember when we wondered what would happen if multiple keys hashed to the same index, and we said that we would worry about it later? Whelp, it's later 🤪.
Let's say we were given the key-value pair ("Ryan", 10)
. Our hash code then maps "Ryan" to index 3. Excellent, that works!Now let's say after we inserted ("Ryan", 10)
, we have to insert ("Parth", 12)
. Our hash code maps "Parth" to index 3. Uh oh! Ryan is already there! What do we do?? 😱
Ok, let's stop freaking out, and let's think about this. If we don't do anything, the value stored at index 3 will just get overwritten. Meaning if we try to retrieve the value associated with "Ryan"
, 12 will be returned instead of 10. That might not seem like a big deal, but what if we were returning passwords based on a user ID, and we returned someone else's password. That would be horrible.
Let's fix this problem. The most common way to solve this is with chaining. If we see multiple values hashed to an index, we will chain them in a some data structure that can hold multiple items. In our case, we'll use Python's list
type, but a more typical solution would use a linked list. We'll cover linked lists in a future module.
Ok, sounds ideal? But how does this work in code? Let's write some of it together.
Follow Along
Below is a partially filled out hash table class where we will be using HashTableEntry
as our chain entries.
Take a look at the code below.
Let's implement the put
method with collision resolution by chaining. What are the two cases we need to handle?
There are no entries at the index. Great! We can initialize the entry to a list with the new
HashTableEntry
in it.There are multiple entries at the index. We need to check every entry in the chain. If the key in one of the entries is equal to the key we are passing in, we need to replace it. For instance, let's say we pass in
("Ryan", 12),
and then we later pass in("Ryan", 15)
. We would need to replace "Ryan"'s old value with 15. If there are no entries that match, we create a new entry at the end of the chain.
Ok, that might sound confusing. Let's start breaking it down into code.
First, we need to hash the key and start with the first entry at that index.
Next, we need to go through the chain. We need to check two conditions:
The current entry is not empty.
The key or the current entry is not equal to the key we are passing in.
Sweet! Now we need to check what happens when the loop breaks. It would only break for two reasons:
We reached an entry with the same key and need to replace the value.
We reached the end of the chain and need to create a new entry.
Let's write that in code!
Great! We created the put
method.
Challenge
https://replit.com/@bgoonz/cs-unit-1-sprint-4-module-2-hash-table-collision-resolution#main.py
Objective 02 - Define and compute the load factor of a hash table and implement a hash table that automatically resizes based on load factor
Overview
What does runtime look like with linked list chaining?
The performance of hash tables for search, insertion, and deletion is constant time (O(1)
) in the average case. However, as the chains get longer and longer, in the worst case, those same operations are done in linear time (O(n)
). The more collisions that your hash table has, the less performant the hash table is. To avoid collisions, a proper hash function and maintaining a low load factor is crucial. What is a load factor?
Load Factor
The load factor of a hash table is trivial to calculate. You take the number of items stored in the hash table divided by the number of slots.
Hash tables use an array for storage. So, the load factor is the number of occupied slots divided by the length of the array. So, an array of length 10 with three items in it has a load factor of 0.3, and an array of length 20 with twenty items has a load factor of 1. If you use linear probing for collision resolution, then the maximum load factor is 1. If you use chaining for collision resolution, then the load factor can be greater than 1.
As the load factor of your hash table increases, so does the likelihood of a collision, which reduces your hash table's performance. Therefore, you need to monitor the load factor and resize your hash table when the load factor gets too large. The general rule of thumb is to resize your hash table when your load factor is greater than 0.7. Also, when you resize, it is common to double the size of the hash table. When you resize the array, you need to re-insert all of the items into this new hash table. You cannot simply copy the old items into the new hash table. Each item has to be rerun through the hashing function because the hashing function considers the size of the hash table when determining the index that it returns.
You can see that resizing is an expensive operation, so you don’t want to resize too often. However, when we average it out, hash tables are constant time (O(1)
) even with resizing.
The load factor can also be too small. If the hash table is too large for the data that it is storing, then memory is being wasted. So, in addition to resizing, when the load factor gets too high, you should also resize when the load factor gets too low.
One way to know when to resize your hash table is to compute the load factor whenever an item is inserted or deleted into the hash table. If the load factor is too high or too low, then you need to resize.
We added a get_load_factor
and resize
method to calculate the load factor and resize the hash table with a new capacity when necessary.
Follow Along
Let's change our put
method to resize when the load factor gets too high. Here's how our current put
method looks:
To know when to resize, we need to correctly increment the count whenever we insert something new into the hash table. Let's go ahead and add that.
Next, we need to check if the load factor is greater than or equal to 0.7. If it is, we need to double our capacity and resize.
Fantastic, we did it!
Objective 02 - Define and compute the load factor of a hash table and implement a hash table that automatically resizes based on load factor
Overview
What does runtime look like with linked list chaining?
The performance of hash tables for search, insertion, and deletion is constant time (O(1)
) in the average case. However, as the chains get longer and longer, in the worst case, those same operations are done in linear time (O(n)
). The more collisions that your hash table has, the less performant the hash table is. To avoid collisions, a proper hash function and maintaining a low load factor is crucial. What is a load factor?
Load Factor
The load factor of a hash table is trivial to calculate. You take the number of items stored in the hash table divided by the number of slots.
Hash tables use an array for storage. So, the load factor is the number of occupied slots divided by the length of the array. So, an array of length 10 with three items in it has a load factor of 0.3, and an array of length 20 with twenty items has a load factor of 1. If you use linear probing for collision resolution, then the maximum load factor is 1. If you use chaining for collision resolution, then the load factor can be greater than 1.
As the load factor of your hash table increases, so does the likelihood of a collision, which reduces your hash table's performance. Therefore, you need to monitor the load factor and resize your hash table when the load factor gets too large. The general rule of thumb is to resize your hash table when your load factor is greater than 0.7. Also, when you resize, it is common to double the size of the hash table. When you resize the array, you need to re-insert all of the items into this new hash table. You cannot simply copy the old items into the new hash table. Each item has to be rerun through the hashing function because the hashing function considers the size of the hash table when determining the index that it returns.
You can see that resizing is an expensive operation, so you don’t want to resize too often. However, when we average it out, hash tables are constant time (O(1)
) even with resizing.
The load factor can also be too small. If the hash table is too large for the data that it is storing, then memory is being wasted. So, in addition to resizing, when the load factor gets too high, you should also resize when the load factor gets too low.
One way to know when to resize your hash table is to compute the load factor whenever an item is inserted or deleted into the hash table. If the load factor is too high or too low, then you need to resize.
We added a get_load_factor
and resize
method to calculate the load factor and resize the hash table with a new capacity when necessary.
Follow Along
Let's change our put
method to resize when the load factor gets too high. Here's how our current put
method looks:
To know when to resize, we need to correctly increment the count whenever we insert something new into the hash table. Let's go ahead and add that.
Next, we need to check if the load factor is greater than or equal to 0.7. If it is, we need to double our capacity and resize.
Fantastic, we did it!
Challenge
Do we need to modify our
delete
andget
methods to account for the newget_load_factor
andresize
methods? Why or why not?
Additional Resources
Homework:
Given a string text, you need to use the characters of text to form as many instances of the word "lambda" as possible.
You can use each character in text at most once.
Write a function that returns the maximum number of instances of "lambda" that can be formed.
Input: text = "mbxcdatlas"
Output: 1
Example 2:
Input: text = "lalaaxcmbdtsumbdav"
Output: 2
Example 3:
Input: text = "sctlamb"
Output: 0
Notes:
text consists of lowercase English characters only
[execution time limit] 4 seconds (py3)
[input] string text
[output] integer
Last updated