Star Schema Benchmark dataset
You may be familiar with TPC benchmarks. These are provided by the eponymous organisation, "a non-profit corporation founded to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry". One of their benchmarks is TPC-H, criticism of which led to the development of the Star Schema Benchmark. The SSB is designed to measure data warehouse performance: details here.
The SSB consists of a single fact table and four dimension tables. Generic table table definitions are contained in the University of Massachusetts paper linked to above. These must be adjusted for the database software to be evaluated. I will provide an example for MapD in a later post. The paper also contains the queries used to assess performance: again adjust as required.
The first step though is to generate some data, which is one of the great things about the standard, as a mechanism is provided to do so. This involves downloading some program files, building the program and running it with parameters to determine the desired data volumes. Doing this in the Cloud seems sensible as not many of us have servers available lying idle.
What follows is for AWS (Amazon Web Services). Any local commands are for macOS; the remote ones are for Redhat derivatives. I assume that you have:
- An AWS account.
- Saved the private key (.pem file) for an Amazon EC2 key pair locally.
- Created an EC2 security group allowing inbound SSH access via TCP on port 22 for your local machine's IP address.
- An IAM (non AWS root) user to execute the steps below (optional but preferred).
- Saved the access key details for that user.
Ensure that the the AWS Command Line Interface is installed locally. I would ignore the AWS instructions and use Homebrew. Ensure that a version of Python is installed beforehand.
$ python --version Python 2.7.10 $ python3 --version Python 3.6.3 $ brew install awscli
Configure the AWS CLI for an AWS user. I have created IAM groups and users in the AWS Management Console (web portal) to avoid using my AWS root account (the one for paying the bills).
$ aws configure AWS Access Key ID [None]: an_IAM_key_id AWS Secret Access Key [None]: related_IAM_access_key Default region name [None]: eu-west-2 Default output format [None]: json
Check it works (you should see a table of regions):
aws ec2 describe-regions --output table
Configure a server
Set up a server to build and store the files successfully. I used a m4.16xlarge with 64 vCPUs. Using Amazon Linux is convenient as the AWS command line tools are already installed. Do not include the ".pem" ending for the key pair.
aws ec2 run-instances --image-id ami-e3051987 --instance-type m4.10xlarge --count 1 --security-group-ids my_securitygroup_id --key-name my_key_pair --region eu-west-2 --query 'Instances.InstanceId'
Connect to the server using its root account:
ssh -i ~/.ssh/my_key.pem ec2-user@public_ip
Check the presence of and configure the AWS CLI on the server. It will prompt for four pieces of information: the AWS Access Key ID and AWS Secret Access Key are your account credentials. Access keys are created using the AWS Management Console. Amazon recommends using IAM access keys instead of AWS account root ones. The AWS CLI is required for saving data to S3 later.
$aws --version aws-cli/1.11.180 Python/3.6.3 $ aws configure AWS Access Key ID [None]: an_IAM_key_id AWS Secret Access Key [None]: related_IAM_access_key Default region name [None]: eu-west-2 Default output format [None]: json
Install the GNU Compiler Collection and other tools:
sudo yum update sudo yum install gcc sudo yum install nano sudo yum install wget sudo yum install unzip
Check disk space
Ensure that you have enough disk space to create the SSB files (just attaching a EBS volume does not make it accessible). View your available disk devices and free disk space:
lsblk df -h
In the list block devices example below, xvdb is not attached.
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / xvdb 202:16 0 1.5T 0 disk
To correct this, create a file system on the volume, a mount point directory and mount the former on the latter:
sudo mkfs -t ext4 /dev/xvdb sudo mkdir /data sudo mount /dev/xvdb /data sudo chmod 777 /data
To mount this EBS volume on every system reboot, add an entry for the device to the /etc/fstab file. Refer to this page for more information.
Build the SSB program
Download the SSB files and unzip them:
wget http://www.cs.umb.edu/~poneil/dbgen.zip -P ~/downloads cd ~/downloads unzip dbgen.zip
Check the default settings which are "DATABASE=DB2 MACHINE=LINUX WORKLOAD=SSBM". The database setting does not seem to matter; leave the workload as is.
cd dbgen nano makefile
Build the dbgen program:
Generate the dataset
A "tbl" file is created for each table. One command ("dbgen -s 1000 -T a") can create the lot. "-s" is the database scale factor (where 1 represents 1 GB of data); "-T" refers to the table. I thought it prudent do so individually. This creates the customer, part, supplier, date and lineorder files respectively.
Copy the dbgen files to wherever you have sufficient space to generate the files, and execute the program from there specifying the full path:
cp /home/ec2-user/downloads/dbgen/* /data cd /data /data/dbgen -s 1000 -T c /data/dbgen -s 1000 -T p /data/dbgen -s 1000 -T s /data/dbgen -s 1000 -T d /data/dbgen -s 1000 -T l
Save the dataset
If you have attached an EBS volume to your instance, it can be detached explicitly by unmounting it, or implicitly by terminating the instance. That way any data saved to it will be preserved after shutting down the server.
To mimimise costs, I will save the files to S3 instead. From the EC2 server:
aws s3 cp table_name.tbl s3://bucket_name/folder_name/table_name.tbl
Terminate the server
Don't forget to terminate the instance.
aws ec2 terminate-instances --instance-ids instance_id