Page Contents
Hi All, In this post I will tell you How To Install Spark And Pyspark On Centos.
Spark Installation
Prerequisite
Make sure Java is installed. To check the Java version, use below command
java -version openjdk version "1.8.0_232" OpenJDK Runtime Environment (build 1.8.0_232-b09) OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
Download the Spark version
Lets download the Spark latest version from the Spark website.
wget http://mirrors.gigenet.com/apache/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
Untar the distribution
tar -xzf spark-3.0.0-preview2-bin-hadoop3.2.tgz ln -s spark-3.0.0-preview2-bin-hadoop3.2 /opt/spark
ls -lrt spark lrwxrwxrwx 1 root root 39 Jan 01 16:40 spark -> /opt/spark-3.0.0-preview2-bin-hadoop3.2
Export the spark path to .bashrc file
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc
Execute .bashrc using source command
source ~/.bashrc
Test the installation
Go to the bin directory of Spark distribution and execute the shell file start-master.sh
$SPARK_HOME/sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ns510700.out
If successfully started, you should be able to see below INFO level message on console
Successfully Started Service
How To Install PySpark
Install pyspark using pip.
pip install pyspark
If successfully installed. You should see following message depending upon your pyspark version.
Successfully built pyspark Installing collected packages: py4j, pyspark Successfully installed py4j-0.10.7 pyspark-2.4.4
Add py4j-0.10.8.1-src.zip to PYTHONPATH
You you face below mentioned error
Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
This can be fixed by adding the PYTHONPATH in .bashrc file as mentioned below
echo 'export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.8.1-src.zip' >> ~/.bashrc source ~/.bashrc
Invoke ipython now and import pyspark and initialize SparkContext.
ipython
In [1]: from pyspark import SparkContext In [2]: sc = SparkContext("local") 20/01/17 20:41:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
If you see above screen, it means pyspark is working fine.
Guys we have successfully installed Spark and PySpark on CentOS. Happy Learning 🙂