How to create an empty RDD in Apache Spark

  • Post category:Spark
  • Reading time:5 mins read

How to create empty RDD in Apache Spark

apache_spark_logo

In this post we will learn how to create an empty RDD in Apache Spark. We can create empty RDD in multiple ways as mentioned below

  • SparkContext emptyRDD method
  • SparkContext parallelize method
  • SparkContext with Empty Pair RDD

Using SparkContext emptyRDD method

Let’s look into the first way of creating an empty RDD. We can use emptyRDD method present in SparkContext.

import org.apache.spark.sql.SparkSession
object EmptyRDDExample extends App {
  // Creating a SparkContext object.
  val sparkContext = SparkSession.builder()
    .master("local[*]")
    .appName("Proedu.co examples")
    .getOrCreate()
    .sparkContext
  // Create Empty RDD of type String.
  val emptyRDD = sparkContext.emptyRDD[String]
  emptyRDD.saveAsTextFile("src/main/output/EmptyRDD")
}

Now, if you check the output folder you will see only _SUCCESS file and _SUCCESS.crc file and no part files as shown below

src/main/output/EmptyRDD
_SUCCESS.crc
_SUCCESS

Using SparkContext parallelize method

The second way of creating empty RDD is parallelize method. We will create RDD of String, but will make it empty.

//  Create RDD of String, but make empty.
val rdd = sparkContext.parallelize(Seq.empty[String])

When we save above RDD , it creates multiple part files which are empty. In the first scenario also the RDD was empty but there was no part file created onto the disk.

// Save RDD 
rdd.saveAsTextFile("src/main/output/EmptyRDDNew")

// Output directory contents. Seven empty part files are created.
src/main/output/EmptyRDDNew
._SUCCESS
.part-00000
.part-00001
.part-00002
.part-00003
.part-00004
.part-00005
.part-00006
.part-00007
_SUCCESS
part-00000
part-00001
part-00002
part-00003
part-00004
part-00005
part-00006
part-00007

Why Spark created empty partition files ?

Let’s understand what happened in above mentioned scenarios. In In both cases RDD is empty, but the real difference comes from number of partitions which is specified by method def getPartitions: Array[Partition]. In the implementation of EmptyRDD (First Approach) it returns Array.empty, which means that potential loop over partitions yields empty result, therefore no partition files are created, where in case of RDD with no records in it we do have set of partitions defined.

Using SparkContext with Empty Pair RDD

Finally let’s talk about the third way of creating an empty RDD using SparkContext. First create the type alias for a pair/tuple and then use SparkContext to create an empty RDD as shown below.

scala> type pairRDD = (String,Int)
defined type alias pairRDD

scala> sc.emptyRDD[pairRDD]
res1: org.apache.spark.rdd.RDD[pairRDD] = EmptyRDD[0] at emptyRDD at <console>:27

When we save the pair RDD, the output path will only contain the _SUCCESS and _SUCCESS.crc files and no partition files.

src/main/output/EmptyPairRDD
._SUCCESS
_SUCCESS

Complete Example

import org.apache.spark.sql.SparkSession
object EmptyRDDExample extends App {
  // Creating a SparkContext object.
  val sparkContext = SparkSession.builder()
    .master("local[*]")
    .appName("Proedu.co examples")
    .getOrCreate()
    .sparkContext
  // Create Empty RDD of type String.
  val emptyRDD = sparkContext.emptyRDD[String]
  emptyRDD.saveAsTextFile("src/main/output/EmptyRDD")
  // Let's create RDD of String, but make empty.
  val rdd = sparkContext.parallelize(Seq.empty[String])
  rdd.saveAsTextFile("src/main/output/EmptyRDDNew")
  // Empty Pair RDD.
  type pairRDD = (String,Int)
  val pairRDD = sparkContext.emptyRDD[pairRDD]
  pairRDD.saveAsTextFile("src/main/output/EmptyPairRDD")
}

Happy Learning 🙂