Good compression algorithms for URIs/URLs

What's the best compression algorithm for URIs that takes into the account the URI syntax. I remember that the WAP specifications had a compression algorithms with tokens for "http://", "www", ".com" etc. I'm looking for something that achieves better results than simply gzipping each URI.

One of the best paper I've ever read about this topic is:

The paper describes how to use AVL tree to achieve ~50% compression. A quote from the paper:

"Like the delta-encoding, our method compresses the URLs by only keeping the differences of URLs tails. To find the URLs different, the sorting of URLs is required. However, a new URL is on the fly discovered and it is impractical to sort out the URLs list every time a new URL is discovered. Instead of sorting all URLs, we incrementally construct an AVL tree of URLs.

[...]

Each node in the tree contains five fields: RefID is a URL identification used to reference to its predecessor, CommonPrefix is a number of common characters referenced to its predecessor, diffURL is the tail of the URL that does not common to its predecessor, Lchild and Rchild are the pointers to the node’s left subtree and right subtree respectively."

There is an experimental AVL Tree implementation in the Andy's scratch area on the Jena repository.

See also "how to compress small strings" on StackOverflow.

Some kind of difference encoding will be quite good. AVL's aren't necesarily very compact (it depends on the ratio of object stored to pointer size).

But if there is advanced knowldge of the URI's to be stored, then a prefix/namespace based compression (i.e. setting the compression tokens based on a preset dictionary) can achieve higher compression.

A mixture of compressing by prefix and by a general mechanism for URI's that don't match supplied prefixes (or expand the prefix set dynamically) would be interesting.

Take a look at this poster paper:

Fernández, J. D., Gutierrez, C., and Martínez-Prieto, M. A. 2010. RDF compression: basic approaches. In Proceedings of the 19th international Conference on World Wide Web (Raleigh, North Carolina, USA, April 26 - 30, 2010). WWW '10. ACM, New York, NY, 1091-1092. DOI= http://doi.acm.org/10.1145/1772690.1772819

Available at http://www.dcc.uchile.cl/~cgutierr/papers/www2010.pdf

This paper is also relevant:

URL Forwarding and Compression in Adaptive Web Caching. B. Scott Michel, Konstantinos Nikoloudakis, Peter Reiher, Lixia Zhang. INFOCOM 2000.

They build trees from URLs based on scheme/domain/path tokens. Hashes are used to encode the tree.

They achieve ~70% compression.