Spark Archives

Apache Spark RDD reduceByKey transformation

reduceByKey(func) converts a dataset of (K, V) pairs, into a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function.

Apache Spark RDD mapPartitions transformation

Apache Spark RDD mapPartitions and mapPartitionsWithIndex

mapPartitions and mapPartitionsWithIndex performs a map operation on an entire partition and returns a new RDD.

Apache Spark RDD groupBy transformation

As per Apache Spark documentation, groupBy returns an RDD of grouped items where each group consists of a key and a sequence of elements.

Apache Spark RDD groupByKey transformation

groupByKey([numPartitions]) is called on a dataset of (K, V) pairs, and returns a dataset of (K, Iterable) pairs.

Apache Spark RDD’s filter transformation

Apache Spark RDD filter transformation

As per Apache Spark, filter(function) returns a new dataset formed by selecting those elements of the source on which function returns true.

Apache Spark RDD’s flatMap transformation

flatMap(func) is similar to map, but each input item can be mapped to 0 or more output items The func should return a scala.collection.Seq

Understanding Apache Spark Map transformation

map(func) transformation returns a new distributed dataset formed by passing each element of the source through a function func.

How to read a file using textFile and wholeTextFiles methods in Apache Spark

In this post we will use textFile and wholeTextFiles in Apache Spark to read a single and multiple text files into a single Spark RDD.

How to create an empty RDD in Apache Spark

We can create empty RDD in three ways using SparkContext's emptyRDD method, parallelize method with RDD of type String and using pair RDD.

How to create RDD in Apache Spark in different ways

Spark RDD can be created from local collection, from a text file or reading from database or by calling transformation on existing RDD.

What is Broadcast Variable in Apache Spark with example

In this article we will dive into the basic concept of broadcast variables. On a very high level broadcast variable is Spark shared variable.

How-To-Install-Spark-And-Pyspark-On-Centos

How To Install Spark And Pyspark On Centos

Hi All, In this post I will tell you How To Install Spark And Pyspark On Centos.

Working with AVRO data

Working with AVRO data in Apache Spark

In this video we will understand how to work with AVRO data in Apache Spark.For the demo we are using Spark 2.4 version and scala language.

Working with Dataframe columns in Spark

Working with DataFrame Columns in Apache Spark

In this video we will understand how to work with DataFrame Columns in Apache Spark.

Working with JSON data

Working with JSON data in Apache Spark

In this video we will learn how to work with JSON data in Apache Spark.

Understanding DataFrame abstraction in Apache Spark

In this video we will understand DataFrame abstraction in Spark.

setup Spark 2.4 cluster on Google Cloud using Dataproc

How to setup Spark 2.4 cluster on Google Cloud using Dataproc

How to setup Spark 2.4 cluster on Google Cloud using Dataproc. Step1 - Create a new project , Step2 - Create a new Cluster using Dataproc.

Installing Apache Spark on Windows

In this post I will tell you how to install Apache Spark on windows machine. By the end of this tutorial you’ll be able to use Spark with Scala on windows.