Anvil

Star Schema Benchmark dataset

You may be familiar with TPC benchmarks. These are provided by the eponymous organisation, "a non-profit corporation founded to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry". One of their benchmarks is TPC-H, criticism of which led to the development of the Star Schema Benchmark. The SSB is designed to measure data warehouse performance: details here.

The SSB consists of a single fact table and four dimension tables. Generic table table definitions are contained in the University of Massachusetts paper linked to above. These must be adjusted for the database software to be evaluated. I will provide an example for MapD in a later post. The paper also contains the queries used to assess performance: again adjust as required.

The first step though is to generate some data, which is one of the great things about the standard, as a mechanism is provided to do so. This involves downloading some program files, building the program and running it with parameters to determine the desired data volumes. Doing this in the Cloud seems sensible as not many of us have servers available lying idle.

Prerequisites

What follows is for AWS (Amazon Web Services). Any local commands are for macOS; the remote ones are for Redhat derivatives. I assume that you have:

  1. An AWS account.
  2. Saved the private key (.pem file) for an Amazon EC2 key pair locally.
  3. Created an EC2 security group allowing inbound SSH access via TCP on port 22 for your local machine's IP address.
  4. An IAM (non AWS root) user to execute the steps below (optional but preferred).
  5. Saved the access key details for that user.

AWS CLI

Ensure that the the AWS Command Line Interface is installed locally. I would ignore the AWS instructions and use Homebrew. Ensure that a version of Python is installed beforehand.

$ python --version
Python 2.7.10
$ python3 --version
Python 3.6.3
$ brew install awscli

Configure the AWS CLI for an AWS user. I have created IAM groups and users in the AWS Management Console (web portal) to avoid using my AWS root account (the one for paying the bills).

$ aws configure
AWS Access Key ID [None]: an_IAM_key_id
AWS Secret Access Key [None]: related_IAM_access_key
Default region name [None]: eu-west-2
Default output format [None]: json

Check it works (you should see a table of regions):

aws ec2 describe-regions --output table

Configure a server

Set up a server to build and store the files successfully. I used a m4.16xlarge with 64 vCPUs. Using Amazon Linux is convenient as the AWS command line tools are already installed. Do not include the ".pem" ending for the key pair.

aws ec2 run-instances 
--image-id ami-e3051987 
--instance-type m4.10xlarge 
--count 1 
--security-group-ids my_securitygroup_id 
--key-name my_key_pair 
--region eu-west-2 --query 'Instances[0].InstanceId'

Connect to the server using its root account:

ssh -i ~/.ssh/my_key.pem ec2-user@public_ip

Check the presence of and configure the AWS CLI on the server. It will prompt for four pieces of information: the AWS Access Key ID and AWS Secret Access Key are your account credentials. Access keys are created using the AWS Management Console. Amazon recommends using IAM access keys instead of AWS account root ones. The AWS CLI is required for saving data to S3 later.

$aws --version
aws-cli/1.11.180 Python/3.6.3
$ aws configure
AWS Access Key ID [None]: an_IAM_key_id
AWS Secret Access Key [None]: related_IAM_access_key
Default region name [None]: eu-west-2
Default output format [None]: json

Install the GNU Compiler Collection and other tools:

sudo yum update
sudo yum install gcc
sudo yum install nano
sudo yum install wget
sudo yum install unzip

Check disk space

Ensure that you have enough disk space to create the SSB files (just attaching a EBS volume does not make it accessible). View your available disk devices and free disk space:

lsblk
df -h

In the list block devices example below, xvdb is not attached.

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0    8G  0 disk 
└─xvda1 202:1    0    8G  0 part /
xvdb    202:16   0  1.5T  0 disk

To correct this, create a file system on the volume, a mount point directory and mount the former on the latter:

sudo mkfs -t ext4 /dev/xvdb
sudo mkdir /data
sudo mount /dev/xvdb /data
sudo chmod 777 /data

To mount this EBS volume on every system reboot, add an entry for the device to the /etc/fstab file. Refer to this page for more information.

Build the SSB program

Download the SSB files and unzip them:

wget http://www.cs.umb.edu/~poneil/dbgen.zip -P ~/downloads
cd ~/downloads
unzip dbgen.zip

Check the default settings which are "DATABASE=DB2 MACHINE=LINUX WORKLOAD=SSBM". The database setting does not seem to matter; leave the workload as is.

cd dbgen
nano makefile

Build the dbgen program:

make

Generate the dataset

A "tbl" file is created for each table. One command ("dbgen -s 1000 -T a") can create the lot. "-s" is the database scale factor (where 1 represents 1 GB of data); "-T" refers to the table. I thought it prudent do so individually. This creates the customer, part, supplier, date and lineorder files respectively.

Copy the dbgen files to wherever you have sufficient space to generate the files, and execute the program from there specifying the full path:

cp /home/ec2-user/downloads/dbgen/* /data
cd /data
/data/dbgen -s 1000 -T c
/data/dbgen -s 1000 -T p
/data/dbgen -s 1000 -T s
/data/dbgen -s 1000 -T d
/data/dbgen -s 1000 -T l

Save the dataset

If you have attached an EBS volume to your instance, it can be detached explicitly by unmounting it, or implicitly by terminating the instance. That way any data saved to it will be preserved after shutting down the server.

To mimimise costs, I will save the files to S3 instead. From the EC2 server:

aws s3 cp table_name.tbl s3://bucket_name/folder_name/table_name.tbl

Terminate the server

Don't forget to terminate the instance.

aws ec2 terminate-instances --instance-ids instance_id