How to read a file using textFile and wholeTextFiles methods in Apache Spark

  • Post category:Spark
  • Reading time:6 mins read

textFile and wholeTextFiles methods in Apache Spark

apache_spark_logo

In this post we will learn how to use textFile and wholeTextFiles methods in Apache Spark to read a single and multiple text files into a single Spark RDD.

Reading Multiple text files from a directory

Let’s see how we can use textFile method to read multiple text files from a directory. Below is the signature of textFile method.

def textFile(path: String,minPartitions: Int): org.apache.spark.rdd.RDD[String]

Where,
path = Path containing partition files.
minPartitions = Min. no of partitions for resulting RDD.

Here we will use textFile method and we will provide the path containing multiple part files as an argument.

  // Reading a path containing multiple text files.
  val file = "src/main/resources/retail_db/categories-multipart"

  // Creating a SparkContext object.
  val sparkContext = SparkSession.builder()
    .master("local[*]")
    .appName("Proedu.co examples")
    .getOrCreate()
    .sparkContext

  // Read a directory containing text files.
  val data = sparkContext.textFile(file)

  // Collecting and printing the data.
  data.collect().foreach(println)

Now let’s read a directory of text files using the wholeTextFiles method. The files can be present in HDFS, a local file system , or any Hadoop-supported file system URI. In this scenario, Spark reads each file as a single record and returns it in a key-value pair, where the key is the path of each file, and the value is the content of each file. The encoding of the text files must be UTF-8.

For example : Our input path contains below files

src/main/resources/retail_db/categories-multipart
part-m-00000
part-m-00001

Then val rdd = sparkContext.wholeTextFile(” src/main/resources/retail_db/categories-multipart “) will return the below RDD.

(/part-00000, its content)
(/part-00001, its content)

Reading text files with a matching pattern

In this scenario we will use the textFile method only but instead of passing the directory path, we will pass a pattern.

For example : Our input path contains below files

src/main/resources/retail_db/categories-multipart
part-m-00000
part-m-00001

Here we will read all files containing 00001 pattern

// Reading text files with a matching pattern.
val patternRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/*00001")
  
patternRDD.collect().foreach(println)

Reading files from multiple directories

textFile method also supports reading from multiple directory locations. We can pass comma separated path of multiple files and directories as an argument to textFile method.

For example : Our input path contains below files

src/main/resources/retail_db/categories
part-m-00000

src/main/resources/retail_db/categories-multipart
part-m-00000
part-m-00001

Let’s read from above two mentioned directory

// Reading files from multiple directories
val multiDirRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/part-m-00001,src/main/resources/retail_db/categories/")
  
multiDirRDD.collect().foreach(println)

Scala Example

import org.apache.spark.sql.SparkSession
object WholeTextFilesExample extends App {
  val file = "src/main/resources/retail_db/categories-multipart"
  // Creating a SparkContext object.
  val sparkContext = SparkSession.builder()
    .master("local[*]")
    .appName("Proedu.co examples")
    .getOrCreate()
    .sparkContext
  // Set Log level to ERROR
  sparkContext.setLogLevel("ERROR")
  // Read a directory containing text files.
  val data = sparkContext.textFile(file)
  //data.collect().foreach(println)
  // Reading a text file using wholeTextFiles method.
  val dataRDD = sparkContext.wholeTextFiles(file)
  //dataRDD.take(10).foreach(println)
  // Reading text files with a matching pattern.
  val patternRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/*00001")
  //patternRDD.collect().foreach(println)
  // Reading files from multiple directories
  val multiDirRDD = sparkContext.textFile("src/main/resources/retail_db/categories-multipart/part-m-00001,src/main/resources/retail_db/categories/")
  multiDirRDD.collect().foreach(println)
}

Github Code

https://github.com/proedu-organisation/spark-scala-examples/blob/main/src/main/scala/rdd/WholeTextFilesExample.scala

Happy Learning 🙂