Hzip

High Quality Compression for your Files on the Cloud

Welcome to the home of
hzip, the high quality compression program for Hadoop and other Cloud Computing solutions. hzip is an archiver, which means it can take in a large number of files and output just one file. Additionally, hzip is a compression program. It is able to reduce the amount of disk space your file uses. It is very much similar to gzipbzip2peazip, or tar, but it operates natively on cloud computing distributed file systems.

Here is an example of what we want to make possible:

andrew%> hadoop dfs -dus /data/crawl
hdfs://andrew.yagn.us:54321/data/crawl     769671879329

andrew%> time hzip --compress /data/crawl.hz /data/crawl
real     10m20.325s
user      0m30.012s
sys       0m25.102s

andrew%> hadoop dfs -dus /data/crawl.hz /data/crawl
hdfs://andrew.yagn.us:54321/data/crawl     769671879329
hdfs://andrew.yagn.us:54321/data/crawl.hz  250193829421

andrew%> hadoop dfs -rmr /data/crawl

andrew%> 
hadoop dfs -dus /data/backup/*
hdfs://andrew.yagn.us:54321/data/backup/a.gz     769671879329
hdfs://andrew.yagn.us:54321/data/backup/b.bz2    250193829421
hdfs://andrew.yagn.us:54321/data/backup/c.zip    150193829421

andrew%> hzip --decompress hadoop dfs -dus /data/backup/*

andrew%> hadoop dfs -dus /data/backup/*
hdfs://andrew.yagn.us:54321/data/backup/a.gz     769671879329
hdfs://andrew.yagn.us:54321/data/backup/a        869671879329
hdfs://andrew.yagn.us:54321/data/backup/b.bz2    250193829421
hdfs://andrew.yagn.us:54321/data/backup/b        350193829421
hdfs://andrew.yagn.us:54321/data/backup/c.zip    150193829421
hdfs://andrew.yagn.us:54321/data/backup/c        250193829421

andrew%>




Software Technology

hzip is implemented on top of Apache's Hadoop platform. The algorithm is written in scala, which is a functional language that compiles into Java byte code and is completely interoperable with Java. We also leverage several other libraries: Apache Commons, slf4j,  specs, and scalacheck.



Algorithms:

We implement Burrows Wheeler Transform(BWT) followed by Jones' splayed Huffman encoding. We try to write our own code and use only the algorithms that are given to the public to avoid patent violations and other legal and illegal conflicts. One of the reasons why we begin to implement hzip on hadoop is that the most time consuming part of the BWT is also an integral part of hadoop's implementation of Google's map-reduce distributed computing paradigm. Because the hadoop platform is freely and easily available to us, we are able to leverage the existing infrastructure to perform the BWT calculation and the rest of the calculations.



Licensing:

The alpha jar release is freeware for you to use. We would appreciate it fi you can tell us of any suggestions or code that you come up with while using hzip. Please email the dev team,



Contact:

To contact the team, please email the dev team, for support, email support hotline. For non-technical and non-support issues, please contact me.



Legal:
Disclaimer: Yagnus Software and its team members claim no responsibility for losses of any kind resulting from use of our software. We express and imply no guarantee about the ability of our software to work in your cloud computing environment, nor do we believe our software will compress your data. In fact, we guarantee that hzip will expand some inputs to be larger than its original size. Copyright: This website, yagnus software source code, and yagnus software compiled binary belong to Yagn.us software.