home | Hashing |
HashingV 0.1Hashing is a method to store data in an array so that storing, searching, inserting and deleting data is fast (in theory it's O(1)). For this every record needs an unique key. The basic idea is not to search for the correct position of a record with comparisons but to compute the position within the array. The function that returns the position is called the 'hash function' and the array is called a 'hash table'. In our examples our key is an integer value as is the actual data. [note that I use pascal syntax since this is easily readable by everybody I asume] type THash_Record = record key : integer; data : integer; end; the hash table now looks like this: const HASH_LENGTH = 100; { the length of the hash table and so the } { maximum number of records that can be stored } type THash_Table = array[0..HASH_LENGTH-1] of THash_Record; If we know that the key is in a small range we could use the key itself as an index (also called hash address) for our array. However this is very rarely the case so we have to find some kind of hash function. A very common and not so bad function is a simple MODulo function: function Hash_Function(key : integer) : integer; begin Hash_Function := key MOD HASH_LENGTH; end; If we now want to insert a record into the hash we could do it this way: procedure Insert_Into_Hash( VAR hash : THash_table; rec : THash_record); begin hash[ Hash_Function(rec.key) ] := rec; end; But wait! What happens if two different keys return the same hash address from the hash function? Well if you have a good hash function this happens very rarely but it can and will happen. There are two ways to handle a so called 'hash collision'.
COLLISION HANDLING OUTSIDE OF THE HASH TABLE ...... means that our hash table no longer is an array of records but instead it is an array of pointers. Each pointer points to a linked list of the hash records.type PHash_Record = ^THash_Record; {a pointer type to the type THash_record } THash_Record = record key : integer; data : integer; next : PHash_Record; { the pointer to the next item in the list } end; THash_Table = array[0..HASH_LENGTH-1] of PHash_Record; If now a key return the same 'hash address' like an already stored item we simply insert the record into the list of this hash entry. procedure Insert_Into_Hash( VAR hash : THash_Table; rec : THash_Record); VAR hash_address : integer; item : PHash_record; begin hash_address := Hash_Function(rec.key); new(item); item^ := rec; item^.next = hash[ hash_address ]; hash[ hash_address ] := item; end; To get an item out of the hash you can first determine it's hash address with the hash function and then use a classical linear search to find the item you want. If you keep the list sorted you could speed read access up for some cost at inserting items. procedure Get_From_Hash( hash : THash_Table; key : integer; VAR rec : THash_Record); VAR hash_address : integer; item : PHash_record; begin hash_address := Hash_Function(rec.key); item := hash[ hash_address ]; while (item <> nil) and (item^.key <> key) do { make sure you have short } item := item^.next; { curcuit bool evaluation, so } { that the second comparison } { is never made if item is nil} if (item<>nil) then rec := item^ else { ITEM NOT IN LIST! } end; If you want to delete an item, you simply remove it from the linked list: procedure Remove_From_Hash( VAR hash : THash_Table; key : integer); VAR hash_address : integer; item : PHash_record; tmp : PHash_record; begin hash_address := Hash_Function(rec.key); item := hash[ hash_address ]; if (item<>nil) and (item^.key = key) then { make sure you have short } begin { curcuit bool evaluation, so } hash[ hash_address ] := item^.next { that the second comparison } dispose(item); { deallocate memory } { is never made if item is nil} end else begin while (item^.next <> nil) and (item^.next^.key <> key) do item := item^.next; if (item^.next <> nil) then begin tmp := item^.next; item^.next := tmp^.next; dispose(tmp); { deallocate memory } end else { ITEM NOT IN LIST! } end; end; COLLISION HANDLING WITHIN THE HASH TABLE...... is totally different. Instead of having a list which can grow up to an arbitrary size all the data is stored in the hash table itself. This means that we can never under any circumstance store more records in the hash as our hash is long. This is obvious but I wanted to point it out.
Let's first solve the latter question since it is quite simply. We have to store wheter a certain hash element is occupied. We do this by changing the THash_Record record: type THash_Occupied = (ho_empty, ho_occupied); THash_Record = record occupied : THash_Occupied; key : integer; data : integer; end; So now to the former question: What to do if an address is already occupied. The answer is straightforward: The have to store it at a different address :) There are several methods to find a new, empty address. Note that this address should have something to do with the key since we want to be able to find it easily later. linear probing:We add a constant number (usually 1) to the position the hash_function returns until we have found an empty place:pos0 := Hash_Function(key); i := 0; repeat hash_address := (pos0 + i) MOD HASH_LENGTH; inc(i); until Hash[ hash_address].occupied <> ho_occupied; sqared probing:is very similar to linear probing but we do not simply add the increasing value but we add it's square. This is usually better than linear probing:pos0 := Hash_Function(key); i := 0; repeat hash_address := (pos0 + i*i) MOD HASH_LENGTH; inc(i); until Hash[ hash_address].occupied <> ho_occupied; One disadvantage of all those probing methods is that they tend to build clusters aroung already occupied areas of the hash table. A soluton to this problem is double hashing:The idea is to add a value to the hash address that depends on the key of the item, so that the probing order is spezific to the key:hash_address := Hash_Function(key); while Hash[ hash_address].occupied = ho_occupied do hash_address := (hash_address + Hash_Function(key)) MOD HASH_LENGTH; double_hashing := hash_address; Hash_Function(rec.key) and HASH_LENGTH should have a greatest common factor of 1 so that the complete hash table is probed. This is guaranteed if HASH_LENGTH is a prime number.
So inserting an item looks like this: procedure Insert_Into_Hash( VAR hash : THash_Table; rec : THash_Record); VAR hash_address; begin <get the hash address, probing for an empty entry> hash[ hash_address ] := rec; rec.occupied := ho_occupied; end; To search for an item, you have to first get the hash_address and then probe until you either find an empty entry or the searched entry. Deleting an item is rather simple: Search for the address like when you search the item and set the occupied value to ho_occupied. But wait! This would break our searching, since we search until we either find the object or an empty entry. The solution is to modify THash_Occupied a bit so that it reads: THash_Occupied = (ho_empty, ho_occupied, ho_deleted); If we delete an item we set the occupied state to ho_deleted. When inserting Items we treat ho_deleted like it was ho_empty and when searching for an item we treat as if it was ho_occupied.
If you have questions, feel free to contact me! |
|
Last
modified: Sun Aug 1 20:26:14 CEST 1999 by Peter Palfrader
|