Anvil

MapD performance using the Star Schema Benchmark

MapD, or more precisely MapD Core is "an in-memory, column store, SQL relational database that was designed from the ground up to run on GPUs. The parallel processing power of GPU hardware enables the MapD Core SQL engine to query billions of rows in milliseconds using standard SQL".1

A previous article discussed GPU databases in general; another how to generate and store Star Schema Benchmark data on Amazon S3. This one shows how to measure MapD database performance against that benchmark using Amazon Web Services.

Prerequisites

Any local commands are for macOS; the remote ones are for Ubuntu. I assume that you have:

  1. An AWS account.
  2. Saved the private key (.pem file) for an Amazon EC2 key pair locally.
  3. Remembered to restrict rights to it (chmod 400 ~/.ssh/my_key.pem)
  4. An IAM (non AWS root) user (optional but preferred).
  5. Saved the access key details for that user.

The private key is for SSH access to the MapD server. The IAM user and access key details will be for copying data from S3 to it.

Choose an AWS EC2 instance

MapD offers a free Community Edition, or a 21-day trial of the Enterprise Edition. The features are listed here, with links to the relevant AWS marketplace page. If choosing one of those, select a region and an EC2 instance type. A relatively cheap option is a p2.8xlarge, which has 8 GPUs and 488 GiB of RAM: P2s are described here.

The disadavantage of using one of the preconfigured instances is that the way they are configured may not be suitable. I found that the enterprise version had a RAID 0 array of four 25GiB disks, which was insufficient. Adding more disks of the same size to the array would not have been enough, and changing the MapD data directory to a new disk proved difficult. Instead I decided to start from scratch, using one of the new P3 instances with Tesla V100s:

Type GPUs GPU Memory vCPUs Memory Network Bandwidth EBS Bandwidth
p3.16xlarge 8 128 GB 64 488 GB 25 Gbps 14 Gbps

Avoid Amazon Linux when selecting an operating system; it is based on an old version of CentOS which is not supported by MapD. I chose an Ubuntu Deep Learning Base AMI (ami-9d10b9e4 in the Ireland region) with the CUDA 9 drivers already installed.

Configure the AWS EC2 instance

Launch you instance, and connect to it once it is up.

ssh -i ~/.ssh/my_key.pem ubuntu@public_ip

Check disk space

Ensure that you have enough disk space to (i) copy the SSB files to; and (ii) import into MapD. If not, create an EBS volume or two via the EC2 Dashboard or the AWS CLI and attach it/them to the MapD instance. Remember that attaching a volume does not make it available for use.

To view your available disk devices and free disk space:

lsblk
df -h

In the list block devices example below, xvdb and xvdc are not attached.

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvdc    202:32   0  650G  0 disk 
xvda    202:0    0    8G  0 disk 
└─xvda1 202:1    0    8G  0 part /
xvdb    202:16   0 1000G  0 disk

One can check whether they even have a file system via (a return value of data means they do not):

sudo file -s /dev/xvdb

To correct this, create a file system on each volume and mount it to a directory:

sudo mkfs -t ext4 /dev/xvdb
sudo mkdir /data	
sudo mount /dev/xvdb /data
sudo chmod 777 /data
sudo mkfs -t ext4 /dev/xvdc
sudo mkdir /staging
sudo mount /dev/xvdc /staging
sudo chmod 777 /staging

Once done, lsblk should show something like:

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvdc    202:32   0  650G  0 disk /staging
xvda    202:0    0    8G  0 disk 
└─xvda1 202:1    0    8G  0 part /
xvdb    202:16   0 1000G  0 disk /data

To mount the EBS volumes on every system reboot, add an entry for each device to the /etc/fstab file (refer to this page for more information). Remember to back up the existing file first.

sudo cp /etc/fstab /etc/fstab.orig
sudo file -s /dev/xvdb
944820fa-a5f2-429e-a05d-2c3c3d7b2e70
sudo file -s /dev/xvdc
5cf695e7-b886-4f44-9397-2fc68b24b12f
sudo nano /etc/fstab
UUID=944820fa-a5f2-429e-a05d-2c3c3d7b2e70  /data  ext4  fdefaults,nofail  0  2
UUID=5cf695e7-b886-4f44-9397-2fc68b24b12f  /staging  ext4  fdefaults,nofail  0  2
sudo mount -a

If you get errors after the last command, revert to the old file (sudo mv /etc/fstab.orig /etc/fstab)

Install prerequisites

If you used the same AMI, you will not need to install anything CUDA-related if you see something similar after running this command.

nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   38C    P0    21W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   36C    P0    21W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   36C    P0    20W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   38C    P0    21W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   37C    P0    21W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P0    20W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   37C    P0    21W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P0    22W / 300W |      0MiB / 16152MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Make sure that the server is otherwise up to date:

sudo apt update
sudo apt upgrade

If anything kernel-related has been updated, reboot from the EC2 portal, not from within Linux.

Install MapD

Create the mapd group and mapd user, who will be the owner of the MapD database.

sudo useradd -U mapd

Download the MapD software:

curl https://releases.mapd.com/ce/mapd-ce-cuda.list | sudo tee /etc/apt/sources.list.d/mapd.list
curl https://releases.mapd.com/GPG-KEY-mapd | sudo apt-key add -
sudo apt update
sudo apt install mapd

Set the environment variables:

nano ~/.bashrc
export MAPD_USER=mapd
export MAPD_GROUP=mapd
export MAPD_STORAGE=/data
export MAPD_PATH=/opt/mapd
source ~/.bashrc

Change storage directory ownership:

sudo chown -R $MAPD_USER $MAPD_STORAGE

Run the installer. The script creates a data directory in $MAPD_STORAGE with the directories mapd_catalogs, mapd_data, and mapd_export. The mapd_import and mapd_log directories are created when inserting data the first time.

cd $MAPD_PATH/systemd
sudo ./install_mapd_systemd.sh

Start MapD Core:

cd $MAPD_PATH
sudo systemctl start mapd_server
sudo systemctl start mapd_web_server

Enable MapD Core to start when the system reboots:

sudo systemctl enable mapd_server
sudo systemctl enable mapd_web_server

Check it works by importing some sample data:

sudo $MAPD_PATH/insert_sample_data

MapD provides a command-line interface called mapdql. By not specifying a database, one connects to the system database, also called mapd.

$MAPD_PATH/bin/mapdql -u mapd -p HyperInteractive

Run a query to get some data:

SELECT origin_city AS "Origin", dest_city AS "Destination", AVG(airtime) AS "Average Airtime" 
FROM flights_2008_10k 
WHERE distance < 175 GROUP BY origin_city, dest_city;

Create a MapD database

Create a new user and database, then disconnect from the mapd database:

create user ssb_tester (password = 'testing');
create database ssb_test (owner = 'ssb_tester');
\q

Connect to the new database:

$MAPD_PATH/bin/mapdql ssb_test -u ssb_tester -p testing

While connected to the new database, execute the SQL Statements below to create the tables.

CREATE TABLE customer(
c_custkey INT NOT NULL,
c_name VARCHAR(25),
c_address VARCHAR(25),
c_city VARCHAR(10),
c_nation_prefix INT,
c_nation CHAR(15),
c_region CHAR(12),
c_phone CHAR(15),
c_mktsegment CHAR(10));

CREATE TABLE dates(
d_datekey INT NOT NULL,
d_date CHAR(18),
d_dayofweek CHAR(8),
d_month CHAR(9),
d_year INT,
d_yearmonthnum INT,
d_yearmonth CHAR(7),
d_daynuminweek SMALLINT,
d_daynuminmonth SMALLINT,
d_daynuminyear SMALLINT,
d_monthnuminyear SMALLINT,
d_weeknuminyear SMALLINT,
d_sellingseason CHAR(12),
d_lastdayinweekfl BOOLEAN,
d_lastdayinmonthfl BOOLEAN,
d_holidayfl BOOLEAN,
d_weekdayfl BOOLEAN);

CREATE TABLE lineorder (
lo_orderkey BIGINT NOT NULL,
lo_linenumber INT,
lo_custkey INT,
lo_partkey INT,
lo_suppkey INT,
lo_orderdate INT,
lo_orderpriority CHAR(15),
lo_shippriority INT,
lo_quantity DOUBLE,
lo_extendedprice DOUBLE,
lo_ordtotalprice DOUBLE,
lo_discount DOUBLE,
lo_revenue INT,
lo_supplycost INT,
lo_tax DOUBLE,
lo_commitdate INT,
lo_shipmode CHAR(10));

CREATE TABLE part(
p_partkey INT NOT NULL,
p_name VARCHAR(22),
p_mfgr CHAR(6),
p_category CHAR(7),
p_brand1 CHAR(9),
p_color VARCHAR(11),
p_type VARCHAR(25),
p_size INT,
p_container CHAR(10));

CREATE TABLE supplier(
s_suppkey INT NOT NULL,
s_name CHAR(25),
s_address VARCHAR(25),
s_city VARCHAR(10),
s_nation_prefix SMALLINT,
s_nation CHAR(15),
s_region CHAR(12),
s_phone CHAR(15));

\t
\q

Import the SSB data

If you have followed a previous post, you will have five table files stored on S3. Copy these to the MapD server and import them into the database, which is the approach MapD recommends; one can import client-side files (\copy command in mapdql) but it is significantly slower.

First ensure the AWS CLI is installed; if not install it.

aws --version

Check that the AWS CLI is configured to work with an IAM user with sufficient rights to copy the S3 files:

aws configure
AWS Access Key ID [None]: an_IAM_key_id
AWS Secret Access Key [None]: related_IAM_access_key
Default region name [None]: eu-west-2
Default output format [None]: json

Now the files can be imported:

aws s3 cp s3://bucket_name/folder_name/table_name.tbl /folder_name/table_name.tbl
mapdql db_name -u username -p password
COPY table_name FROM 'table_name.tbl' WITH (delimiter = '|', header='false', max_reject=1);

In the event of errors, look in the logs.

ls -lh $MAPD_STORAGE/mapd_log

I did have problems with some of the tables due the pipe character at the end of each line. If you encounter the same issue, remove them using:

sed -i 's/.$//' filename.tbl

I managed to load:

TableRowsS3 size
customer2.7 GB
dates222 KB
lineorder6,000,000,000608 GB
part164 MB
supplier163 MB

Run the queries

Connect to the database and run the SSB queries.

# Q1.1

select sum(lo_extendedprice*lo_discount) as revenue 
from lineorder, dates
where lo_orderdate = d_datekey
and d_year = 1993
and lo_discount between 1 and 3 
and lo_quantity < 25;

# Q1.2

select sum(lo_extendedprice*lo_discount) as revenue 
from lineorder, dates
where lo_orderdate = d_datekey
and d_yearmonthnum = 199401 
and lo_discount between 4 and 6 
and lo_quantity between 26 and 35;

# Q1.3

select sum(lo_extendedprice*lo_discount) as revenue 
from lineorder, dates
where lo_orderdate = d_datekey
and d_weeknuminyear = 6
and d_year = 1994
and lo_discount between 5 and 7 
and lo_quantity between 26 and 35;

# Q2.1

select sum(lo_revenue), d_year, p_brand1 
from lineorder, dates, part, supplier
where lo_orderdate = d_datekey 
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey 
and p_category = 'MFGR#12' 
and s_region = 'AMERICA'
group by d_year, p_brand1 
order by d_year, p_brand1;

# Q2.2

select sum(lo_revenue), d_year, p_brand1 
from lineorder, dates, part, supplier 
where lo_orderdate = d_datekey
and lo_partkey = p_partkey 
and lo_suppkey = s_suppkey 
and p_brand1 between 'MFGR#2221' and 'MFGR#2228' 
and s_region = 'ASIA'
group by d_year, p_brand1 
order by d_year, p_brand1;

# Q2.3

select sum(lo_revenue), d_year, p_brand1 
from lineorder, dates, part, supplier where lo_orderdate = d_datekey
and lo_partkey = p_partkey 
and lo_suppkey = s_suppkey
and p_brand1 = 'MFGR#2221'
and s_region = 'EUROPE' 
group by d_year, p_brand1 
order by d_year, p_brand1;

# Q3.1

select c_nation, s_nation, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dates
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_region = 'ASIA' 
and s_region = 'ASIA' 
and d_year >= 1992 
and d_year <= 1997
group by c_nation, s_nation, d_year 
order by d_year asc, revenue desc;

# Q3.2

select c_city, s_city, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dates
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_nation = 'UNITED STATES'
and s_nation = 'UNITED STATES'
and d_year >= 1992 
and d_year <= 1997
group by c_city, s_city, d_year 
order by d_year asc, revenue desc;

# Q3.3

select c_city, s_city, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dates
where lo_custkey = c_custkey 
and lo_suppkey = s_suppkey 
and lo_orderdate = d_datekey 
and (c_city='UNITED KI1'or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and d_year >= 1992 
and d_year <= 1997 
group by c_city, s_city, d_year
order by d_year asc, revenue desc;

# Q3.4

select c_city, s_city, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dates
where lo_custkey = c_custkey 
and lo_suppkey = s_suppkey 
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and d_yearmonth = 'Dec1997' 
group by c_city, s_city, d_year order by d_year asc, revenue desc;

# Q4.1

select d_year, c_nation, sum(lo_revenue - lo_supplycost) as profit 
from dates, customer, supplier, part, lineorder
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_partkey = p_partkey
and lo_orderdate = d_datekey
and c_region = 'AMERICA'
and s_region = 'AMERICA'
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by d_year, c_nation 
order by d_year, c_nation;

# Q4.2

select d_year, s_nation, p_category, sum(lo_revenue - lo_supplycost) as profit 
from dates, customer, supplier, part, lineorder
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_partkey = p_partkey
and lo_orderdate = d_datekey
and c_region = 'AMERICA'
and s_region = 'AMERICA'
and (d_year = 1997 or d_year = 1998) 
and (p_mfgr = 'MFGR#1'or p_mfgr = 'MFGR#2')
group by d_year, s_nation, p_category 
order by d_year, s_nation, p_category;

# Q4.3

select d_year, s_city, p_brand1, sum(lo_revenue - lo_supplycost) as profit
from dates, customer, supplier, part, lineorder 
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_partkey = p_partkey
and lo_orderdate = d_datekey
and c_region = 'AMERICA'
and s_nation = 'UNITED STATES' 
and (d_year = 1997 or d_year = 1998) 
and p_category = 'MFGR#14'
group by d_year, s_city, p_brand1 
order by d_year, s_city, p_brand1;