textFile and wholeTextFiles methods in Apache Spark
In this post we will learn how to use textFile and wholeTextFiles methods in Apache Spark to read a single and multiple text files into a single Spark RDD.
Reading Multiple text files from a directory
Let’s see how we can use textFile method to read multiple text files from a directory. Below is the signature of textFile method.
def textFile(path: String,minPartitions: Int): org.apache.spark.rdd.RDD[String] Where, path = Path containing partition files. minPartitions = Min. no of partitions for resulting RDD.
Here we will use textFile method and we will provide the path containing multiple part files as an argument.
// Reading a path containing multiple text files. val file = "src/main/resources/retail_db/categories-multipart" // Creating a SparkContext object. val sparkContext = SparkSession.builder() .master("local[*]") .appName("Proedu.co examples") .getOrCreate() .sparkContext // Read a directory containing text files. val data = sparkContext.textFile(file) // Collecting and printing the data. data.collect().foreach(println)
Now let’s read a directory of text files using the wholeTextFiles method. The files can be present in HDFS, a local file system , or any Hadoop-supported file system URI. In this scenario, Spark reads each file as a single record and returns it in a key-value pair, where the key is the path of each file, and the value is the content of each file. The encoding of the text files must be UTF-8.
For example : Our input path contains below files
src/main/resources/retail_db/categories-multipart part-m-00000 part-m-00001
Then val rdd = sparkContext.wholeTextFile(” src/main/resources/retail_db/categories-multipart “) will return the below RDD.
(/part-00000, its content) (/part-00001, its content)
Reading text files with a matching pattern
In this scenario we will use the textFile method only but instead of passing the directory path, we will pass a pattern.
For example : Our input path contains below files
src/main/resources/retail_db/categories-multipart part-m-00000 part-m-00001
Here we will read all files containing 00001 pattern
// Reading text files with a matching pattern.
val patternRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/*00001")
patternRDD.collect().foreach(println)
Reading files from multiple directories
textFile method also supports reading from multiple directory locations. We can pass comma separated path of multiple files and directories as an argument to textFile method.
For example : Our input path contains below files
src/main/resources/retail_db/categories part-m-00000 src/main/resources/retail_db/categories-multipart part-m-00000 part-m-00001
Let’s read from above two mentioned directory
// Reading files from multiple directories val multiDirRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/part-m-00001,src/main/resources/retail_db/categories/") multiDirRDD.collect().foreach(println)
Scala Example
import org.apache.spark.sql.SparkSession object WholeTextFilesExample extends App { val file = "src/main/resources/retail_db/categories-multipart" // Creating a SparkContext object. val sparkContext = SparkSession.builder() .master("local[*]") .appName("Proedu.co examples") .getOrCreate() .sparkContext // Set Log level to ERROR sparkContext.setLogLevel("ERROR") // Read a directory containing text files. val data = sparkContext.textFile(file) //data.collect().foreach(println) // Reading a text file using wholeTextFiles method. val dataRDD = sparkContext.wholeTextFiles(file) //dataRDD.take(10).foreach(println) // Reading text files with a matching pattern. val patternRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/*00001") //patternRDD.collect().foreach(println) // Reading files from multiple directories val multiDirRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/part-m-00001,src/main/resources/retail_db/categories/") multiDirRDD.collect().foreach(println) }
Github Code
Happy Learning 🙂