How to create RDD using SparkContext parallelize method
In this post we will learn how to create Spark RDD using SparkContext’s parallelize method. Resilient Distributed Dataset (RDD) is the most basic building block in Apache Spark. RDD is a collection of objects that is partitioned and distributed across nodes in a cluster.
Creating RDD in Spark Shell
SparkContext is automatically created with the name “sc” in Spark Shell. SparkContext class has a method “parallelize” which accepts a scala.collection.Seq<T> as an input argument. This method distributes a local Scala collection to form an RDD.
// Creating a local Scala collection. scala> val days = List("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday") days: List[String] = List(Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday) // Using sc (SparkContext) to create RDD. scala> val daysRDD = sc.parallelize(days) daysRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26 // Collect and show RDD contents. scala> daysRDD.collect.foreach(println) Sunday Monday Tuesday Wednesday Thursday Friday Saturday
We can also specify the number of partitions while creating an RDD using sc.parallelize method.
// Providing the number of partitions to divide the collection into. scala> val daysRDD = sc.parallelize(days,3) daysRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:26 // Validate the number of partitions. scala> daysRDD.getNumPartitions res2: Int = 3
Creating RDD in IDE
In this example we will use IDE to create Spark RDD. First we will create a SparkContext Object as below
// Create SparkConf object. val sparkConf = new SparkConf() sparkConf.setAppName("RDDCreation") sparkConf.setMaster("local[*]") // Create SparkContext object using SparkConf. val sc = new SparkContext(sparkConf)
We can also create a SparkContext using SparkSession object
val sparkContext = SparkSession.builder().master("local[*]") .appName("RDDCreation") .getOrCreate() .sparkContext
Complete Example
import org.apache.spark.{SparkConf, SparkContext} object RDDCreation extends App { // Creating a local Scala collection. val days = List("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") // Create SparkConf object. val sparkConf = new SparkConf() sparkConf.setAppName("RDDCreation") sparkConf.setMaster("local[*]") // Create SparkContext from SparkConf object. val sc = new SparkContext(sparkConf) //Using sc (SparkContext) to create RDD. val daysRDD = sc.parallelize(days) // Collect and show RDD contents. daysRDD.collect.foreach(println) }
How to create an empty RDD
scala> sc.parallelize(Seq.empty) res4: org.apache.spark.rdd.RDD[Nothing] = ParallelCollectionRDD[3] at parallelize at <console>:25
Happy Learning 🙂