How to create empty RDD in Apache Spark
In this post we will learn how to create an empty RDD in Apache Spark. We can create empty RDD in multiple ways as mentioned below
- SparkContext emptyRDD method
- SparkContext parallelize method
- SparkContext with Empty Pair RDD
Using SparkContext emptyRDD method
Let’s look into the first way of creating an empty RDD. We can use emptyRDD method present in SparkContext.
import org.apache.spark.sql.SparkSession object EmptyRDDExample extends App { // Creating a SparkContext object. val sparkContext = SparkSession.builder() .master("local[*]") .appName("Proedu.co examples") .getOrCreate() .sparkContext // Create Empty RDD of type String. val emptyRDD = sparkContext.emptyRDD[String] emptyRDD.saveAsTextFile("src/main/output/EmptyRDD") }
Now, if you check the output folder you will see only _SUCCESS file and _SUCCESS.crc file and no part files as shown below
src/main/output/EmptyRDD _SUCCESS.crc _SUCCESS
Using SparkContext parallelize method
The second way of creating empty RDD is parallelize method. We will create RDD of String, but will make it empty.
// Create RDD of String, but make empty.
val rdd = sparkContext.parallelize(Seq.empty[String])
When we save above RDD , it creates multiple part files which are empty. In the first scenario also the RDD was empty but there was no part file created onto the disk.
// Save RDD rdd.saveAsTextFile("src/main/output/EmptyRDDNew") // Output directory contents. Seven empty part files are created. src/main/output/EmptyRDDNew ._SUCCESS .part-00000 .part-00001 .part-00002 .part-00003 .part-00004 .part-00005 .part-00006 .part-00007 _SUCCESS part-00000 part-00001 part-00002 part-00003 part-00004 part-00005 part-00006 part-00007
Why Spark created empty partition files ?
Let’s understand what happened in above mentioned scenarios. In In both cases RDD is empty, but the real difference comes from number of partitions which is specified by method def getPartitions: Array[Partition]
. In the implementation of EmptyRDD
(First Approach) it returns Array.empty
, which means that potential loop over partitions yields empty result, therefore no partition files are created, where in case of RDD with no records in it we do have set of partitions defined.
Using SparkContext with Empty Pair RDD
Finally let’s talk about the third way of creating an empty RDD using SparkContext. First create the type alias for a pair/tuple and then use SparkContext to create an empty RDD as shown below.
scala> type pairRDD = (String,Int) defined type alias pairRDD scala> sc.emptyRDD[pairRDD] res1: org.apache.spark.rdd.RDD[pairRDD] = EmptyRDD[0] at emptyRDD at <console>:27
When we save the pair RDD, the output path will only contain the _SUCCESS and _SUCCESS.crc files and no partition files.
src/main/output/EmptyPairRDD ._SUCCESS _SUCCESS
Complete Example
import org.apache.spark.sql.SparkSession object EmptyRDDExample extends App { // Creating a SparkContext object. val sparkContext = SparkSession.builder() .master("local[*]") .appName("Proedu.co examples") .getOrCreate() .sparkContext // Create Empty RDD of type String. val emptyRDD = sparkContext.emptyRDD[String] emptyRDD.saveAsTextFile("src/main/output/EmptyRDD") // Let's create RDD of String, but make empty. val rdd = sparkContext.parallelize(Seq.empty[String]) rdd.saveAsTextFile("src/main/output/EmptyRDDNew") // Empty Pair RDD. type pairRDD = (String,Int) val pairRDD = sparkContext.emptyRDD[pairRDD] pairRDD.saveAsTextFile("src/main/output/EmptyPairRDD") }
Happy Learning 🙂