How To Create RDD Using Spark Context Parallelize Method

  • Post category:Spark
  • Reading time:4 mins read

How to create RDD using SparkContext parallelize method

apache_spark_logo

In this post we will learn how to create Spark RDD using SparkContext’s parallelize method. Resilient Distributed Dataset (RDD) is the most basic building block in Apache Spark. RDD is a collection of objects that is partitioned and distributed  across nodes in a cluster.

Creating RDD in Spark Shell

SparkContext is automatically created with the name “sc” in Spark Shell. SparkContext class has a method “parallelize” which accepts a scala.collection.Seq<T> as an input argument. This method distributes a local Scala collection to form an RDD.

// Creating a local Scala collection.
scala> val days = List("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
days: List[String] = List(Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday)

// Using sc (SparkContext) to create RDD.
scala> val daysRDD = sc.parallelize(days)
daysRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26

// Collect and show RDD contents.
scala> daysRDD.collect.foreach(println)
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday

We can also specify the number of partitions while creating an RDD using sc.parallelize method.

// Providing the number of partitions to divide the collection into.
scala> val daysRDD = sc.parallelize(days,3)
daysRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:26

// Validate the number of partitions.
scala> daysRDD.getNumPartitions
res2: Int = 3

Creating RDD in IDE

In this example we will use IDE to create Spark RDD. First we will create a SparkContext Object as below

  // Create SparkConf object.
  val sparkConf = new SparkConf()
  sparkConf.setAppName("RDDCreation")
  sparkConf.setMaster("local[*]")
  

  // Create SparkContext object using SparkConf.
  val sc = new SparkContext(sparkConf)

We can also create a SparkContext using SparkSession object

val sparkContext = SparkSession.builder().master("local[*]")
    .appName("RDDCreation")
    .getOrCreate()
    .sparkContext

Complete Example

import org.apache.spark.{SparkConf, SparkContext}
object RDDCreation extends App {
  // Creating a local Scala collection.
  val days = List("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
  // Create SparkConf object.
  val sparkConf = new SparkConf()
  sparkConf.setAppName("RDDCreation")
  sparkConf.setMaster("local[*]")
  // Create SparkContext from SparkConf object.
  val sc = new SparkContext(sparkConf)
  //Using sc (SparkContext) to create RDD.
  val daysRDD = sc.parallelize(days)
  // Collect and show RDD contents.
  daysRDD.collect.foreach(println)
}

How to create an empty RDD

scala> sc.parallelize(Seq.empty)
res4: org.apache.spark.rdd.RDD[Nothing] = ParallelCollectionRDD[3] at parallelize at <console>:25

Happy Learning 🙂