Anvil

Spark data

When learning to use a new piece of software, it helps to have some data to work with. Here is an example for Apache Spark. I did this on OSX (El Capitan), but apart from the first step, the steps below should work on a Linux cluster. I had already set up Hadoop in pseudo-distributed mode.

The example below uses weather data from the USA's National Oceanic and Atmospheric Administration. They consist of thousands of zipped text files grouped into year folders from 1901 onwards.

Install wget

Ignore this if using Linux. Curl cannot be used to download a directory without writing looping code. The easiest way to install wget is to install Homebrew, which in turn requires the Command Line Tools. On the assumption that anyone reading this will already have done so, enter the following at the command-line:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install wget

Download data

I would start with a year's worth of data, at least on a development machine. The uncompressed files for a more recent year take up c20 GB. Change directory to a suitable location and enter (where YYYY refers to the year selected):

wget -r ftp://ftp.ncdc.noaa.gov/pub/data/noaa/YYYY/

Create a single uncompressed annual file

Loading thousands of files into Hadoop is hardly efficient. To convert all the downloaded .gz files into one text file:

gunzip -cr download-directory-nametarget-file-name.txt

Load into HDFS

Create a target folder and load the file:

hadoop fs -mkdir /target-directory
hadoop fs -copyFromLocal /local-file-system-path-and-name.txt /hdfs-path-and name.txt

Load into Spark

If just experimenting using the Scala REPL:

spark-shell
val myRDD = sc.textFile("hdfs://localhost/hdfs-path-and-name.txt"

Finally

Keep in mind how much memory you have before doing this on a laptop. Spark does cope, but it somewhat defeats the object if it spills to disk.