Apache Spark RDD reduceByKey transformation
reduceByKey(func) converts a dataset of (K, V) pairs, into a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function.
reduceByKey(func) converts a dataset of (K, V) pairs, into a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function.
mapPartitions and mapPartitionsWithIndex performs a map operation on an entire partition and returns a new RDD.
As per Apache Spark documentation, groupBy returns an RDD of grouped items where each group consists of a key and a sequence of elements.
groupByKey([numPartitions]) is called on a dataset of (K, V) pairs, and returns a dataset of (K, Iterable) pairs.
As per Apache Spark, filter(function) returns a new dataset formed by selecting those elements of the source on which function returns true.
flatMap(func) is similar to map, but each input item can be mapped to 0 or more output items The func should return a scala.collection.Seq
map(func) transformation returns a new distributed dataset formed by passing each element of the source through a function func.
In this post we will use textFile and wholeTextFiles in Apache Spark to read a single and multiple text files into a single Spark RDD.
We can create empty RDD in three ways using SparkContext's emptyRDD method, parallelize method with RDD of type String and using pair RDD.
Spark RDD can be created from local collection, from a text file or reading from database or by calling transformation on existing RDD.
Scala Lists are similar to arrays but there are two important differences.First, lists are immutable, which means elements of a list cannot be changed by assignment.Second, lists represent a linked list whereas arrays are flat.
Why Scala is getting popular? Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way.
This article on Scala Environment Setup will guide you through setting up Scala for your system. .Scala can be installed on any UNIX flavored or Windows based system.
Regular expressions are strings which can be used to find patterns (or lack thereof) in data. Any string can be converted to a regular expression using the .r method.
Scala can use Java class like PrintWriter to read and write files. Console class can be used to read input paramaters and Source class to read from files.
Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way.
Problem You want to start a Scala application with a main method, or provide the entry point for a script. Solution There are two ways to create a launching point…