pythonexample.py
on thecluster. Alternatively, you can install Jupyter Notebook on the cluster usingAnaconda Scale. See the Installation documentation for moreinformation.$ brew cask install anaconda
. Please see resources section in case you face any issue in that step.$ brew install apache-spark
$ brew cask install caskroom/versions/java8
to install Java8, you will not see this error if you have it already installed.$ pyspark
. If you see the below it means that it has been installed properly:$ pyspark
. At this step, I present the steps you have to follow in order create Jupyter Notebooks automatically initialised with SparkContext. $ ls -a
, if you don't have one, create one using $ touch ~/.bash_profile
$ brew info apache-spark
$ vim ~/.bash_profile
, press I
in order to insert, and paste the following codes in any location (DO NOT delete anything in your file):ESC
to exit insert mode, enter :wq
to exit VIM. You could fine more VIM commands here.$ source ~/.bash_profile
pip install pyspark
.I try using pip
to install pyspark
but I couldn’t get the pyspark
cluster to get started properly. Reading several answers on Stack Overflow and the official documentation, I came across this:The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
spark.stop()
in order to end the SparkSession. Let's understand the various settings that we define above:master
: Sets the Spark master URL to connect to, such as “local” to run locally, “local[4]” to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.config
:Sets a config option by specifying a (key, value) pair.appName
: Sets a name for the application, if no name is set, a randomly generated name will be used.getOrCreate
:Gets an existing SparkSession
or, if there is no existing one, creates a new one based on the options set in this builder. In case an existing SparkSession is returned, the config options specified in this builder affecting the SQLContext
configuration will applied. As SparkContext
configuration cannot be modified on runtime (you have to stop existing context first) whileSQLContext
configuration can be modified on runtime.