Page Contents
How to enable compression in Apache Hive ?
In this post, I’m going to walk you through the process of enabling compression in Apache Hive. By enabling compression in Hive, we can significantly save the required storage space and also increase the throughput and performance. This may seem counter-intuitive because compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.
Hadoop jobs tend to be I/O bound, rather than CPU bound so compressing the data will improve performance. However, If the job is CPU bound, then compression will probably lower the performance. The only way to really know is to experiment with different options and measure the results.
First of all how to identify the installed codecs?
Open hive console and type below command
hive> set io.compression.codecs;
The command will show available codes in a comma separated list
So the available compression codes are:
io.compression.codecs= org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.DeflateCodec, org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.Lz4Codec
We can compress the Intermediate data and the final output either through the configuration file or through hive shell.
How to enable Compression on Intermediate Data?
Compression on Hive Intermediate data can be enabled by setting the property hive.exec.compress.intermediate either from hive Shell using set command or in hive-site.xml file.
This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
How to enable Compression on the final output?
Open hive shell and execute below commands
hive>set hive.exec.compress.output=true; hive>set mapreduce.output.fileoutputformat.compress=true; hive>set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec; hive>set mapreduce.output.fileoutputformat.compress.type=BLOCK;
DONE. Happy Learning 🙂