Compressing arrays of integers while keeping fast indexing

While adding support for editing and viewing text encoded in UTF-8 to HxD’s hex editor control itself, it turns out I have to query Unicode property tables, that go beyond the basic ones included with Delphi (and most other languages / default libraries).

Parsing the structured text files, provided by the Unicode consortium, at each startup is too inefficient, and merely storing the parsed text into a simple integer array wastes too much memory.

A more efficient storage uses a dictionary-like approach, to compress the needed data using a few layers of indirections, while still giving array-like performance with constant (and negligible) overhead.

In the following, I’ll briefly present the solution I found.

Simple lookup table (wasteful on memory)

The easiest way to store a property table for Unicode code points, is to define an array which indexes over all the possible Unicode code points (from 0 to 1114111) and which assigns each entry of the array a property value. The type of the array values will be an integer (an enumeration), a bit field, or a similarly simple type.

Take for example the Unicode general category values. UnicodeData.txt assigns each code point (one line per code point) a general category in the 3rd column (each column is separated by a semicolon). The code point 000A for example has the general category Cc.

When assigning each category a number (a byte), we can express the assignment of each of the 1114112 code points to a category as an array of 1114112 bytes. This results in a file of a little over 1 MiB. Compressing it to a ZIP file, we can reduce it to 4 KiB (and depending on the settings and used algorithms an even larger compression is possible).

Although, this is a considerable reduction in size, such compression algorithms rely (also) on summarizing sequences of identical values (think of Run-length encoding). Indeed, if we look at the byte array representation of the general category table, we notice that it contains many identical values over larger index ranges, which suggests we could benefit from RLE. However there are many gaps with different values in between, as well.

The real issue however is that RLEs will force us to scan the array sequentially, while we would need random access, as provided by a dictionary data structure, so we can efficiently query properties for each code point.

As mentioned above, a code point is essentially a key (=index) into an array of integer values. However, due to the possible range of keys (0 to 1114111), we would end up with at least 1088 KiB (as long as we do not use less than 8 bits per value, which would make array access complicated).

Therefore, we have to reduce the range of keys using hash functions, and use several levels of has functions/hash tables to resolve the likely collisions.

Compressed dictionary

Most hash tables are created during run-time, and use fixed hash functions that cannot minimize the range of output values (which are indices into the next hash table, and therefore affect the size of the subsequent hash table).

Since the Unicode tables do not change (for a given version), we can encode each table as a set of hash tables at compile time, and iterate over various hash tables trying to minimize the total size of the hash table set.

To that end I wrote a class TCompressedDictionary, that can be passed a set of parameters, to define how many hash functions and hash tables to build, and how to specialize them. These hash tables and hash functions are applied in a chained fashion to lookup a value for a key.

Each hash function is given as generic function, that uses shift, multiplication, addition and bit masking, and can be adapted by setting parameters (s1, s2, s3, and Bias), where Bias is the result from a previous hash table lookup:

Bias * s1 + ((Key shr s2) and s3);

1	Bias * s1 + ((Key shr s2) and s3);

This has the effect that each hash function extracts some of the bits of the Key, to subdivide it into several keys, that can be used individually on the smaller hash tables making up the entire TCompressedDictionary. Since the partial keys are smaller, they effectively reduce the size of each hash table index range, and therefore the hash table size itself.

The chaining of hash tables and hash functions is sketched in the following pseudo code:

Bias = 0
Bias = hashtable1[Bias * a1 + ((Key shr a2) and a3)]
Bias = hashtable2[Bias * b1 + ((Key shr b2) and b3)]
.
.
.
FinalLookupValue = simple_lookuptable[Bias * z1 + ((Key shr z2) and z3)]

Bias = 0

Bias = hashtable1[Bias * a1 + ((Key shr a2) and a3)]

Bias = hashtable2[Bias * b1 + ((Key shr b2) and b3)]

FinalLookupValue = simple_lookuptable[Bias * z1 + ((Key shr z2) and z3)]

Optimization

I first set out to use an optimization algorithm, such as stochastic gradient descent or simulated annealing, to find the optimal parameters (ai, bi, …, zi). But it proves difficult to define an error function that is sufficiently smooth or “differentiable” to make it likely for the optimization algorithms to find a minimum.

One of the reasons was that there are many invalid parameter configurations, some of which cause out of memory errors, while others still produce collisions even after chaining 3 hash tables. For these no size can be determined and therefore no proper distance from a perfect or good solution can be calculated (which means the optimization algorithm cannot determine in which direction the largest size reduction can be expected).

Finally, I just wrote an exhaustive search over reasonable ranges of the parameters, and found there are only few candidates that even provide a valid solution (while ignoring size). As such it seems likely, that using an optimization method would not work, since there are so few solution candidates, that there is no error gradient to descend. The solutions are spread too “randomly” over the solution space to form regions (and therefore error gradients) of reasonable size. Random sampling, as used in many optimization algorithms, would be likely to miss those small regions and gradients.

This may also be one of the reasons why neural nets have issues with symbolic reasoning: a sparse solution space (especially common in logic inference), with only one or few correct solutions, and hard to define measures of distance (error) to a correct solution. It may also not be possible to gradually improve solutions by mere parameter tuning relative to the magnitude of prediction error (as done in backpropagation).

A partial answer can be found here: Why is it difficult to integrate neural networks with symbolic reasoning? I found another somewhat related article: Scalable Deep Symbolic Reinforcement Learning.

Conclusion

I will investigate this problem in more depth in future, since it seems machine learning should be able to help here. If anyone has a good idea how to define an error function for minimizing hash table sizes (while also considering the amount of collisions), be sure to let me know in the comments.

For now, the non-optimal but reasonable reduction (from 1 MiB) to 16 KiB is satisfactory, while providing very efficient random-access lookups, with three levels of chained hash functions and hash tables.

Tuning in to growing words

Compressing arrays of integers while keeping fast indexing

Simple lookup table (wasteful on memory)

Compressed dictionary

Optimization

Conclusion

Leave a Reply Cancel reply