Discussion - Algorithms for Re-Pair compression

The improvements to running time and memory use of the merge with cuto version seem signicant enough to more than oset the slight reduction in com-pression ratio for most purposes. For some types of input it is on par with the cuto version in terms of compression, and only for the sources and english les does the compression ratio fall below that of basic repair.

Figure 12.3: Memory usage for all versions

If time and memory are of no concern then cuto without merge gives the best compression ratio, outperforming both basic repair and the merge versions across the board, but it requires almost as much memory as basic Re-Pair and oers only a moderate increase in running time.

Inserting new pairs at the back of priority queue lists is clearly very important for the number of generations in the dictionary, and given how the dictionary is compressed it can have a not insignicant impact on the size of the com-pressed dictionary. This is in contrast to the statement by [1] that "the order in which pairs should be scheduled for replacement [...] appears to be of minor importance"¹, but that may be a result of encoding the dictionary in a dierent way.

1[1] page 3

Chapter 13

Conclusion

We have implemented several versions of our prototype program of the Re-Pair algorithm, and seen improvements in running time, memory use, and compres-sion eectiveness.

In testing whether there was any dierence between using a FIFO method in the priority queue lists compared to a LIFO ordering, we saw that it has a major impact on the number of generations of phrases that the algorithm creates.

Depending on the data, lowering the number of generations can greatly improve the compression ratio that can be achieved due to the overhead involved in storing each generation.

An important result is seen in the version which merges symbols together two and two before doing phrase derivation. For a comparable block size this halves the size of the sequence array during compression, leading to several improve-ments of the algorithm compared to other versions. It nearly halves the running time, showing the greatest reduction of any version we have tested, and as a bonus it also signicantly reduces the memory use. The downside is that is puts harsher requirements on the formation of pairs, which can lead to fewer pair replacements and thus a somewhat worse compression eectiveness.

We learned from our cuto version, that the amount of pairs which occur 2-3 times is signicantly higher than the amount of pairs with greater frequencies.

More importantly they take up very little disk space in the compressed le relative to the dictionary entries created when they are compressed. The number of pairs that we can avoid replacing by having an earlier cuto reduces the running time by a moderate amount, but since dictionary entries take up so much space compared to Human codes for symbols, we see an increase in compression ratio by stopping the program prematurely. In terms of improving compression eectiveness the cuto version shows the best results.

The best result we have seen in regards to running time is from the combination of the merge and cuto versions. This version benets from the reduction in running time of both the other versions, and the improvement to compression in cuto applied to merge means that it achieves compression ratios close to the basic version.

13.1 Future work

There are many more things to examine in regards to the Re-Pair algorithm, and in this section we mention some that we nd particularly interesting.

It would be interesting to look into creating a version that uses more than one thread to compress blocks concurrently, taking advantage of the fact that we divide les into multiple blocks. Blocks do not share any data other than the les they read from and are output to, so it should be relatively simple to imple-ment without introducing a lot of race conditions. As we showed earlier, most of Re-Pair is spent on phrase derivation, so this is where we want to save time.

Even if you choose to do initialization and outputting sequentially to keep the order of the le, you should be able to save a lot of time. The biggest concern of making a multi-threaded version is increasing the already substantial memory requirements of Re-Pair. The available hardware will dictate the possible block size that can be used, but to minimize this problem a multi-threaded implemen-tation would likely be based on some variant of the merge version, as it uses the least amount of memory.

The reduction in number of generations when using a FIFO approach in the priority queue lists had a greater impact on the le with the greatest number of phrases and generations. These numbers are not only based on the data in the le, but also on how much of the le is read at once, i.e. the block size. It could be interesting to test the eects of that version on more data and various block sizes.

13.1 Future work 77

We did not have the chance to implement a version that merges more than two symbols, but the promising results of merging suggests that it would be interesting to rewrite the program to overcome the problems we found and test multi-merge. This would involve nding a way to dierentiate terminal symbols from the phrases created during compression.

The dictionary for the compressed le takes up a lot of space, and the only reason cuto has a positive eect on compression is the poor relation between symbols and dictionary entries. It would be interesting to nd a more elegant dictionary scheme that would reduce the overall size after compression, and thus improve the ratio to the original le. It would also be interesting to look at how much of the success of the cuto versions can be attributed to how the dictionary is stored. The reason cuto can improve the compression ratio is that entries in the dictionary take up more space than low frequency pairs, so a dierent method of storing the dictionary might change that. In particular it would be interesting to examine the performance of cuto together with the chiastic slide method described by [1] (page 7 - 8).

Bibliography

[1] N. Jesper Larsson and A. Moat, O-line dictionary-based compression, Proceedings of the IEEE, vol. 88, no. 11, pp. 17221732, 2000.

[2] Google, Sparsehash project, 2012. [Online]. Available: https://code.

google.com/p/sparsehash/

[3] N. Welch, Hash table benchmarks. [Online]. Available: http://incise.org/

hash-table-benchmarks.html

[4] B. Dawes, D. Abrahams, and R. Rivera, Boost Library Project. [Online].

Available: http://www.boost.org/

[5] I. H. Witten, A. Moat, and T. C. Bell, Managing gigabytes (2nd ed.):

compressing and indexing documents and images, Dec. 1999. [Online].

Available: http://dl.acm.org/citation.cfm?id=323905

[6] J. Kleinberg and E. Tardos, Algorithm Design, ser. Pearson international edition. Pearson/Addison-Wesley, 2006. [Online]. Available: http:

//books.google.dk/books?id=c57CQgAACAAJ

[7] P. Ferragina and G. Navarro, Pizza & Chili Corpus, 2005. [Online].

Available: http://pizzachili.dcc.uchile.cl/

[8] P. Cegielski and D. Richard, Decidability of the theory of the natural integers with the cantor pairing function and the successor, Theoretical Computer Science, vol. 257, no. 1-2, pp. 5177, Apr. 2001. [Online].

Available: http://linkinghub.elsevier.com/retrieve/pii/S0304397500001092

In document Algorithms for Re-Pair compression (Sider 83-89)