Spark Archives

create Dataframe in Spark Scala

How to create Spark dataframe in different ways

How create Spark dataframe in different ways. toDF , createDataframe, spark.read.csv, spark.read.json, spark.read.avro

Apache Spark RDD reduceByKey transformation

reduceByKey(func) converts a dataset of (K, V) pairs, into a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function.

Apache Spark RDD mapPartitions transformation

Apache Spark RDD mapPartitions and mapPartitionsWithIndex

mapPartitions and mapPartitionsWithIndex performs a map operation on an entire partition and returns a new RDD.

Apache Spark RDD groupBy transformation

As per Apache Spark documentation, groupBy returns an RDD of grouped items where each group consists of a key and a sequence of elements.

Apache Spark RDD groupByKey transformation

groupByKey([numPartitions]) is called on a dataset of (K, V) pairs, and returns a dataset of (K, Iterable) pairs.

Apache Spark RDD’s filter transformation

Apache Spark RDD filter transformation

As per Apache Spark, filter(function) returns a new dataset formed by selecting those elements of the source on which function returns true.

Apache Spark RDD’s flatMap transformation

flatMap(func) is similar to map, but each input item can be mapped to 0 or more output items The func should return a scala.collection.Seq

Understanding Apache Spark Map transformation

map(func) transformation returns a new distributed dataset formed by passing each element of the source through a function func.

How To Install Apache Cassandra on CentOS 7

How To Install Apache Cassandra on CentOS 7. In first step, install Java on machine and then in second step install Apache Cassandra.

How to read a file using textFile and wholeTextFiles methods in Apache Spark

In this post we will use textFile and wholeTextFiles in Apache Spark to read a single and multiple text files into a single Spark RDD.

How to create an empty RDD in Apache Spark

We can create empty RDD in three ways using SparkContext's emptyRDD method, parallelize method with RDD of type String and using pair RDD.

How to create RDD in Apache Spark in different ways

Spark RDD can be created from local collection, from a text file or reading from database or by calling transformation on existing RDD.

How To Create RDD Using Spark Context Parallelize Method

In this post we will learn how to create Spark Resilient Distributed Dataset (RDD) using SparkContext's parallelize method.

What is Apache Spark RDD

RDD stands for Resilient Distributed Dataset. Its a distributed dataset which has the capability to recover from failures.

What is Spark Accumulator with example

Accumulators are shared variables provided by Spark that can be mutated by multiple tasks running in different executors.

What is Broadcast Variable in Apache Spark with example

In this article we will dive into the basic concept of broadcast variables. On a very high level broadcast variable is Spark shared variable.

Repartition and Coalesce In Apache Spark with examples

Repartition and coalesce is a way to reshuffle he data in the RDD randomly to create either more or fewer partitions.

How-To-Install-Spark-And-Pyspark-On-Centos

How To Install Spark And Pyspark On Centos

Hi All, In this post I will tell you How To Install Spark And Pyspark On Centos.

Manipulating String Columns in Dataframe

Manipulating String columns in Dataframe

In this video we will understand how to manipulate the String columns in Dataframe. For the demo we are using Spark 2.4 version and scala language.

Working with AVRO data

Working with AVRO data in Apache Spark

In this video we will understand how to work with AVRO data in Apache Spark.For the demo we are using Spark 2.4 version and scala language.