[Shield] Free Daemon Consulting, LLC [Todd]
/home     /rates     /goals     /tech     /news     /contact     /hosting     /links     

Hash Zip

What is 'Hash Zip' many people ask? It is a compression algorithm developed by myself with my own ideas as part of the ongoing research I do in the free software arena when I am not billing hours for my company and other company related activities.

The basic concept of hzip is to compress and decompress a set of data, trading CPU time for compression ratio (emphasis on a _lot_ of CPU time) using a hash and a few other `interesting' bits of data.

Currently, sha1 is the hash algorithm. I intend to make this a command line flag in the future, but for now it is fixed on sha1. For every 256 bytes of input, 30 bytes of output result, guaranteed. The big challenge that lies ahead is to 'verify' this works for all possibilities of bits (2^256 verifications). I acknowledge that on today's CPU, this is not possible. In the future, however, I expect someone or someone(s) to prove or disprove that this algorithm is valid.


Mon Jul 14 07:49:42 CDT 2003

Thanks to everyone for their well thought input regarding number theory and the reasons why my algorithm doesn't work for all possible input values. While I conced that it is a true statement, I'm wanting to determine where my algorithm fails (specifically which input values), so I can make it work for the general case, and perhaps expand a data structure or two in the specific cases where it does break down.

Hash Zip release 0.2 (Thu Jul 10 16:41:05 CDT 2003)

Major features of this new release:

  • header block added

    Major/Minor version numbers are checked, 'HZIP' will from this point on always be the 1st 5 bytes of the file, and decompressing non hzip files no longer is attempted.

  • hash algorithm type is now stored in the header, shaved 1 byte off the per-block data as a result
  • use the shaved byte to store a 'valid chars count'; see below

`Valid Chars Count' is a (I believe) useful addition to hzip. However, the implementation is slow causing no speedup to occur (yet) on most decompression runs. I must detail a bit about how the `interesting information' is used to understand `Valid Chars Count'. There are three different types of masks that are used to attempt to limit the sha1 hash calls, and also limit the size of the possibilities being brute forced. An optimization in 0.1 was to create a list of `valid chars' in a table, and iterate through it instead of applying the masks to each possible combination of bits. It is my experience that this list of 'valid chars' is a superset of the set of u_int8_t values in the data stream. So now on compression, I 'count' the number of 'valid chars' that are indeed used to generate the output data set. And on decompression, I (once again) count the number of valid chars that are in the potential output data set. If the numbers do not match, a sha1 hash is not computed. I simply need to tune the counting routine a bit to speed up hzip more.

An example of the 'valid chars count' effect on decompression can be illustrated by the comparison of decompressing three strings with a trailing carriage return between release 0.1 and 0.2:

0.1	Total hash calls for block 1: 5746167
	85.98s real    41.78s user     0.01s system
0.2	Total hash calls for block 1: 4460925
        105.59s real    49.71s user     0.00s system

0.1	Total hash calls for block 1: 37777
	0.52s real     0.50s user     0.01s system
0.2	Total hash calls for block 1: 4
	0.43s real     0.41s user     0.01s system

0.1	Total hash calls for block 1: 377777
	22.03s real    10.81s user     0.00s system
0.2	Total hash calls for block 1: 4
	18.16s real     9.35s user     0.00s system

Hash Zip release 0.1 (Wed Jul 9 12:45:35 CDT 2003)

WARNING: This is pre-alpha in-flux unstable proof-of-concept software. Do not expect anything but conveyance of concept. Perhaps a usable program for now, but I _will_ be breaking binary compatibility with the output next revision. So don't compress anything you want to keep just yet.

I've made a few cleanups since the last version you saw, but the big news is that it now takes 25% of the original time to decompress 'Hello '. Sure, 46 seconds is a lot, but compared to 183 seconds, this is very good.

Two reasons for the speedup:

I added a 'u_int8_t' value cset to the hash block data. The upper 4 bits and the lower 4 bits are counters. The upper is the max number of bits set per char, the lower is the min number of bits set per char.

This 'cset' added information reduced the time from 183 to 66 seconds.

I also sped up 'incr()' quite a bit by computing a buffer of 'valid' characters based on all the information known (currently from cset, mset, and uset). Now, incr() simply iterates through this buffer for every character in the data being computed, rather than applying the logic of the masks to each character being considered.

This is what reduced the time from 66 to 43 seconds.

I expect the next release (0.2) to come shortly, with the addition of a header field, to version control the data stream, do some sanity checks, store a hash of the whole file, and further reduce the hash block size by fixing the hash algorithm for the whole file, thereby mentioning it once in the header vs one char for the type of hash algorithm per hash block.

I also discovered I was using the wrong '#define' so instead of 36 bytes for a hash I am using 20.

Currently, the 'hash block' weighs in at 30 bytes. I intend to remove the hash algorithm byte, but add another byte that will help determine what values should be used to initialize the computation, and which direction (up or down) incr() should head.

I've also removed the headache of the 'len' byte being 'size = 2 ^ len'. I didn't want to add a byte to the data stream for the final block, and also I've decided that 256 bytes is large enough for one block of data, instead of the former maximum size of 2^16 (64k).

Besides, guaranteed compression to 11% the original size still isn't bad, eh?

I'm still not checking for collisions (haven't run across any yet, but I intend to). When I do, this will dramatically slow down the compression to be equal to the decompression speed. Sorry folks, but if there is to be guaranteed 1:1 input to output ratio, I can see no way of avoiding this.


Valid HTML 4.01! vipower Valid CSS!