When learning to use a new piece of software, it helps to have some data to work with. Here is an example for Apache Spark. I did this on OSX (El Capitan), but apart from the first step, the steps below should work on a Linux cluster. I had already set up Hadoop in pseudo-distributed mode.
The example below uses weather data from the USA's National Oceanic and Atmospheric Administration. They consist of thousands of zipped text files grouped into year folders from 1901 onwards.
Ignore this if using Linux. Curl cannot be used to download a directory without writing looping code. The easiest way to install wget is to install Homebrew, which in turn requires the Command Line Tools. On the assumption that anyone reading this will already have done so, enter the following at the command-line:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" brew install wget
I would start with a year's worth of data, at least on a development machine. The uncompressed files for a more recent year take up c20 GB. Change directory to a suitable location and enter (where YYYY refers to the year selected):
wget -r ftp://ftp.ncdc.noaa.gov/pub/data/noaa/YYYY/
Create a single uncompressed annual file
Loading thousands of files into Hadoop is hardly efficient. To convert all the downloaded .gz files into one text file:
gunzip -cr download-directory-name > target-file-name.txt
Load into HDFS
Create a target folder and load the file:
hadoop fs -mkdir /target-directory hadoop fs -copyFromLocal /local-file-system-path-and-name.txt /hdfs-path-and name.txt
Load into Spark
If just experimenting using the Scala REPL:
spark-shell val myRDD = sc.textFile("hdfs://localhost/hdfs-path-and-name.txt"
Keep in mind how much memory you have before doing this on a laptop. Spark does cope, but it somewhat defeats the object if it spills to disk.