Blog Posts
Spark
September 6, 2017

Spark on Windows 10

The third in a series of blogs from Anandraj Jagadeesan, talks us through downloading Apache Spark on Windows 10, using the new Ubuntu environment.

About Spark

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark is very popular as it can carry out parallel execution without having to code for it, DSL support for popular languages, along with rich APIs for various data processing scenarios and special needs such as SQL, Machine learning, Graph processing and Streaming. Spark can consume data in various formats and sources such as flat files, streaming data, parquet, Avro, jdbc, etc. Spark applications are IO intensive and I have seen Spark working at its best when processing files from HDFS or S3 as they are free from IO bottlenecks of databases and streaming systems.

This article is about setting up Spark in Windows 10, using the new Ubuntu environment. To follow the instructions below, the pre-requisites are to have Windows 10 build 15063 and Ubuntu bash for Windows which you can get by following my previous blogs.

Downloading Spark

Any version of Apache Spark could be downloaded from the link. In the page choose the version of Spark to be run, Hadoop version and the mirror. Then click the download spark link to download the contents packaged in tgz.

Unpacking Contents:

If you followed my previous blog-post and installed Ubuntu Bash for windows. Then open the Ubuntu console, change directory to Downloads folder and execute the below command by replacing the tgz file name.

tar –xvzf <.tgz file name>

Below is the example for unpacking “spark-2.2.0-bin-hadoop2.7.tgz”

Java:

One of the pre-requsite to run Spark is a recent version of Java. Check Java 8 installed on the Ubuntu for Windows, issue the below command on the console as shown below:

If Java is not available or has older version, then issue the commands to uprgrade

sudo apt-get update

sudo apt-get install default-jdk

Running Spark-Shell:

All pre-requisites are now complete. Go to the Linux console, change to spark download folder, then run

bin/spark-shell

Below example shows creation of Spark RDD and performing calculation on the same.

Development environment:

In order to develop spark applications in Scala setup below tools

Install SBT

echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823

sudo apt-get update

sudo apt-get install sbt

Get Eclipse Scala IDE from link http://scala-ide.org/
In my github account, I have a setup a sample Spark Project with all mandatory dependencies. Also, It has sbt plug-in to generate eclipse project and fat jar to run the application without having to install Spark in the runtime environment. Steps to setup are:
1. Clone the project from – https://github.com/anandrajj/spark-example
2. Run the command “sbt eclipse” from the project root to generate the eclipse project
3. Import the project in eclipse
To run the application, issue “sbt assembly” to build the fat-jar, then use the spark-submit to run the application. Learn how to use spark-submit here.
For Windows environment, setup HADOOP_HOME environment variable to run Spark from eclipse. Download the winutils.exe here.Spark-example project also includes below features.
SparkApp – Similar to Scala App. When extended by a Spark Job creates spark context, sql context, validates arguments etc.
Scala App to query Redshift and postgres dbs.
Convert CSV to Parquet
Convert parquet to CSV
Sample word count example showing SaprkApp in action.

Hyperscaler native vs best of breed services – what should you choose for your cloud data & analytics solution?

Blog Posts

Enhanced dbt Data Quality Observability at Speed

Blog Posts

Why you may need a dedicated stream to tackle source system onboarding as part of your data transformation program

Blog Posts

Modernising your legacy time-series solution using Cloud Data ecosystems

Connect with us

If you’d like to be kept in the loop on courses, events and other related topics, simply complete your details and we’ll add you to our list.

Consulting Services

Frameworks

Training

Spark on Windows 10

Share

Leave a Reply Cancel reply

Recent

Hyperscaler native vs best of breed services – what should you choose for your cloud data & analytics solution?

Enhanced dbt Data Quality Observability at Speed

Why you may need a dedicated stream to tackle source system onboarding as part of your data transformation program

Modernising your legacy time-series solution using Cloud Data ecosystems

Connect with us

We are Altis

Quick Links

Contact us

Connect with us