How To Install Spark And Pyspark On Centos

Page Contents

Hi All, In this post I will tell you How To Install Spark And Pyspark On Centos.

Spark Installation

Prerequisite

Make sure Java is installed. To check the Java version, use below command

java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

Download the Spark version

Lets download the Spark latest version from the Spark website.

wget http://mirrors.gigenet.com/apache/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz

Untar the distribution

tar -xzf spark-3.0.0-preview2-bin-hadoop3.2.tgz
ln -s spark-3.0.0-preview2-bin-hadoop3.2 /opt/spark

ls -lrt spark
lrwxrwxrwx 1 root root 39 Jan 01 16:40 spark -> /opt/spark-3.0.0-preview2-bin-hadoop3.2

Export the spark path to .bashrc file

echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc

Execute .bashrc using source command

source ~/.bashrc

Test the installation

Go to the bin directory of Spark distribution and execute the shell file start-master.sh

$SPARK_HOME/sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ns510700.out

If successfully started, you should be able to see below INFO level message on console

Successfully Started Service

How To Install PySpark

Install pyspark using pip.

pip install pyspark

If successfully installed. You should see following message depending upon your pyspark version.

Successfully built pyspark Installing collected packages: py4j, pyspark Successfully installed py4j-0.10.7 pyspark-2.4.4

Add py4j-0.10.8.1-src.zip to PYTHONPATH

You you face below mentioned error

Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

This can be fixed by adding the PYTHONPATH in .bashrc file as mentioned below

echo 'export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip' >> ~/.bashrc
source ~/.bashrc

Invoke ipython now and import pyspark and initialize SparkContext.

ipython

In [1]: from pyspark import SparkContext
In [2]: sc = SparkContext("local")
20/01/17 20:41:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

If you see above screen, it means pyspark is working fine.

Guys we have successfully installed Spark and PySpark on CentOS. Happy Learning 🙂