How to create Spark dataframe in different ways
How create Spark dataframe in different ways. toDF , createDataframe, spark.read.csv, spark.read.json, spark.read.avro
How create Spark dataframe in different ways. toDF , createDataframe, spark.read.csv, spark.read.json, spark.read.avro
reduceByKey(func) converts a dataset of (K, V) pairs, into a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function.
mapPartitions and mapPartitionsWithIndex performs a map operation on an entire partition and returns a new RDD.
As per Apache Spark documentation, groupBy returns an RDD of grouped items where each group consists of a key and a sequence of elements.
groupByKey([numPartitions]) is called on a dataset of (K, V) pairs, and returns a dataset of (K, Iterable) pairs.
As per Apache Spark, filter(function) returns a new dataset formed by selecting those elements of the source on which function returns true.
flatMap(func) is similar to map, but each input item can be mapped to 0 or more output items The func should return a scala.collection.Seq
map(func) transformation returns a new distributed dataset formed by passing each element of the source through a function func.
How To Install Apache Cassandra on CentOS 7. In first step, install Java on machine and then in second step install Apache Cassandra.
In this post we will use textFile and wholeTextFiles in Apache Spark to read a single and multiple text files into a single Spark RDD.
We can create empty RDD in three ways using SparkContext's emptyRDD method, parallelize method with RDD of type String and using pair RDD.
Spark RDD can be created from local collection, from a text file or reading from database or by calling transformation on existing RDD.
In this post we will learn how to create Spark Resilient Distributed Dataset (RDD) using SparkContext's parallelize method.
RDD stands for Resilient Distributed Dataset. Its a distributed dataset which has the capability to recover from failures.
Accumulators are shared variables provided by Spark that can be mutated by multiple tasks running in different executors.
In this article we will dive into the basic concept of broadcast variables. On a very high level broadcast variable is Spark shared variable.
Repartition and coalesce is a way to reshuffle he data in the RDD randomly to create either more or fewer partitions.
Hi All, In this post I will tell you How To Install Spark And Pyspark On Centos.
In this video we will understand how to manipulate the String columns in Dataframe. For the demo we are using Spark 2.4 version and scala language.
In this video we will understand how to work with AVRO data in Apache Spark.For the demo we are using Spark 2.4 version and scala language.