How to enable compression in Apache Hive

  • Post category:Big Data
  • Reading time:4 mins read

How to enable compression in Apache Hive ?

In this post, I’m going to walk you through the process of enabling compression in Apache Hive. By enabling compression in Hive, we can significantly save the required storage space and also increase the throughput and performance. This may seem counter-intuitive because compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.

Hadoop jobs tend to be I/O bound, rather than CPU bound so compressing the data will improve performance. However, If the job is CPU bound, then compression will probably lower the performance. The only way to really know is to experiment with different options and measure the results.

First of all how to identify the installed codecs?

Open hive console and type below command

hive> set io.compression.codecs;

The command will show available codes in a comma separated list

So the available compression codes are:

io.compression.codecs=
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.SnappyCodec,
org.apache.hadoop.io.compress.Lz4Codec

We can compress the Intermediate data and the final output either through the configuration file or through hive shell.

How to enable Compression on Intermediate Data?

hive-site.xml

Compression on Hive Intermediate data can be enabled by setting the property hive.exec.compress.intermediate either from hive Shell using set command or in hive-site.xml file.

This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop config variables mapred.output.compress*

hive-site.xml

How to enable Compression on the final output?

Open hive shell and execute below commands

hive>set hive.exec.compress.output=true;
hive>set mapreduce.output.fileoutputformat.compress=true;
hive>set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;  
hive>set mapreduce.output.fileoutputformat.compress.type=BLOCK;

DONE. Happy Learning 🙂