I am working on a highly secure application and need to compress data such as string and byte arrays. I am using the java.util.zip.* classes, but I am having some problems.
First, when using the Deflator and Inflator classes, I get DataFormatExceptions when the string is less than 30 characters.
Second, I have a question about the compression itself. I am using ByteArrayOutputStream and DeflaterOutputStream. I noticed that the compressdata.length() > OriginalData.length() where OriginalData is the uncompressed data. It doesn't seem to make sense that the compressed length is longer than the uncompressed length.
Can this be right?
Before I directly answer your question, there are a few things that I think I need to mention in order to avoid any confusion
or misconceptions. Specifically, you say that you are working on a highly secure system. Compressing your data does nothing
to make it more secure. You may know this already, but for everyone's benefit I thought it worth mentioning. Anyone can decompress
your compressed data. True, it will take a little work, but once someone knows how you have achieved your compression, decoding
it becomes trivial. To be truly secure, you'll need to pre-encrypt your data or encrypt your compressed data using something
like SSL.
In order to answer the first part of your question, I tested a string less than 30 characters and one greater than 30 characters.
The only time that I could get a DataFormatException was when Inflater and Deflater were constructed with different nowrap values. Be sure that the Inflater and Deflater specify nowrap the same way. If the Deflater sets nowrap to false, the Inflater must do the same. Likewise, if the Deflater sets it to true, the Inflater must set it to true.
Whether or not to set nowrap to true or false depends on your needs. A true nowrap omits the ZLIB header and checksum data
from the compressed data. A false no wrap leaves it. However, the Inflater's nowrap must be set to match the compressed input. Otherwise, as we have seen, you will get a DataFormatException.
Your second question raises an important fact about data compression. As strange as it may seem, the compressed data size
can be larger than the uncomp ressed size. Depending on your Deflater settings, the Deflater may append a header to the compressed data. This header is used to decode the information and check it for errors. If you
deal with very small strings, it is likely that not much real compression has gone on. Cutting a string of 30 characters to
15, while a 50 percent reduction, is only a reduction of 15 characters. As a result, the added size of the header makes the
compressed string longer than the original. You will not see the benefits of compression until your data reaches a certain
larger, precompression size. It's hard to say what this size is, but generically it is where: (compressed size + header size) < uncompressed size. If your data is not large enough, you're wasting time using compression.
You may also want to consider some of the other compression settings. Some compression algorithms are optimized for time, while others achieve a better compression but take longer to decompress. So the algorithm that you choose goes a long way in determining the final size of your compressed data.