Actuarial Data Science

Bridging the Gap between Actuarial and Data Science

Tue 29 August 2017

How to share a database with your team mates

Posted by Pieter Marres in Articles   

postgresql-database

Earlier this year, I had the opportunity to work together with some data scientists on an algo trading project. Team members were scattered around the globe. In order to train and test several models, we wanted to share an existing local 20 GB database containing 10 year of historical data for more than 50 different currency pairs (1 minute bars).

Objective

This post features a tutorial explaining how to migrate an existing local database to the Relation Database Service from Amazon (Amazon RDS):

  • each team member will be able to login with his/her own credentials into the database;
  • each team member can run his/her own queries against the database;
  • each team member can access the database from both Python and R;
  • expenses for hosting the database will be modest;
  • the data base administrator can grant access to new folks during the project, if needed.

About Amazon RDS

Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the cloud. It provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks. Amazon RDS supports several databases like Oracle, MySQL and PostgreSQL.

Approach

In this tutorial we will migrate a local PostgreSQL database named “currencies” to Amazon RDS. First, we will create a RDS instance on Amazon. Second, we will dump the local PostgreSQL database to a flat text file with SQL commands using the utility program pg_dump. Finally, we restore this dump file into the remote PostgreSQL database. Please keep in mind, that the process of dumping and restoring can take a while as the database grows big. For this tutorial we will migrate a simple two table database:

currency-db-design

The example database contains 2 years of historical prices (hourly bars) for seven currency pairs: AUDUSD (also nicknamed with a funny name: “Aussie”), EURUSD (“Fiber”), GBPUSD (“Cable”), NZDUSD (“Kiwi”), USDCAD (“Loonie”), USDCHF (“Swissy”) and GBPJPY (“Gopher”).

Step #1: Create a DB Instance Running the PostgreSQL Database Engine

Keep in mind, that you need to have an account with Amazon (AWS login username and AWS password). In order to sign up, you have to register your credit card number, but they won’t charge you till you pass the Free Tier limits (o.a. 20 GB storage). You can create up to 40 (!) instances.

Now, go to Amazon and follow the instructions till you arrive at a screen like this for specifying the settings for your instance:

create-new-instance-amazon

In the screens that follow, you can simply select the default settings and hit “Launch DB Instance”. Please be patient! It can take a while till the DB Instance has been created as the status moves from creating… to backing up… to available….

Once the status of the DB instance has been updated to available, you can see the details of your instance by checking its row in the Filter column, followed by hitting “Instance Actions” and “See Details”, resulting in a screen like this:

db-instance-details

Take note of the following items as you’ll need them when uploading data to AWS in step #3:

  • endpoint (here: currency-db-instance.c1w9pflnbhj8.eu-west-1.rds.amazonaws.com);
  • port (here: 5432)
  • username (here: masterusr)
  • password (here: your-password)
  • database name (here: currencydb)

Technical note: to make this all work, the instance should be accessible from TCP/IP connections! This was for me the showstopper for using the (further excellent) web hosting pythonanaywhere.com as they do allow to connect to the database from TCP/IP. AWS, however, supports this feature by choosing the appropriate security group.

Step #2: Generate a dump from your local database

Fire up your terminal and cd to your PostgreSQL folder, followed by the pg_dump statement:

$ cd /var/lib/postgresql/9.3
$ pg_dump –U Pieter currencies > /dir/where/to/store/dump/ dbexport.pgsql

Be patient. Making a dump from a large database can take a while!

Step #3: restore dump file to AWS remote database

From now on, we will use the psql command line (see this post for a survival kit). Open the terminal and login on your AWS instance as master user:

$ psql --host=currency-db-instance.c1w9pflnbhj8.eu-west-1.rds.amazonaws.com --port=5432 --username=masterusr --password --dbname=currencydb
Password for user masterusr: <your_password>

Now comes the exiting part. We need to restore the dump into AWS:

b2fdb=> \i /home/pieter/Desktop/dbexport.pgsql

Again, be patient! Restoring a 20GB dump file, took about 1 hour on my laptop. I got some messages on the command line, like “role postgres does not exist” but they seem to be harmless as the process went through.

Step #4: create roles for your team mates

You are the superuser with the credentials (username and password) defined in step #1. As a superuser, you have all rights (create/drop tables, insert/drop records, create/alter users, etc.). Now you want to give your team mates – let’s call them John, Mary and Jennifer - access to the database. You want to grant each of them the same permission, namely selecting records from the database. From the psql command line, you can define 3 individual roles that inherit their permission (selection only) from the group named “myteam” like so:

# myteam is a group role, having by convention no login
CREATE ROLE myteam  GRANT SELECT ON ALL TABLES IN SCHEMA public TO myteam;
CREATE ROLE john LOGIN INHERIT;
ALTER ROLE john PASSWORD pw_john;
GRANT myteam TO john;
CREATE ROLE mary LOGIN INHERIT;
ALTER ROLE mary PASSWORD pw_mary;
GRANT myteam TO mary;
CREATE ROLE jennifer LOGIN INHERIT;
ALTER ROLE jennifer PASSWORD pw_jennifer;
GRANT myteam TO jennifer;

Technical note: Why did we define a team role before granting permissions to our team mates? The advantage of having a team role in place, is that you can now modify the permissions for all team members at once by just changing the permissions for “myteam”. For example, some day you may decide to upgrade access for all team members from just selecting records (GRANT SELECT) to modifying records (GRANT INSERT).

Step #5: connect to your database from Python and R

Each team member can now connect to the database from his/her favorite programming environment:

In Python:

import psycopg2
import pandas as pd
import pandas.io.sql as pdsql
con = psycopg2.connect(database='currencydb',
                       user='TEAM_MEMBER_USERNAME',
                       password='TEAM_MEMBER_PASSWORD',
                       host='currency-db-instance.c1w9pflnbhj8.eu-west-1.rds.amazonaws.com',
                       port='5432')
df = pdsql.read_sql("SELECT * FROM tbl_currencies", con)
con.close()

In R:

require("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv,
                 dbname='currencydb',
                 user='TEAM_MEMEBER_USERNAME',
                 password='TEAM_MEMBER_PASSWORD',
                 host='currency-db-instance.c1w9pflnbhj8.eu-west-1.rds.amazonaws.com',
                 port='5432')

df <- dbGetQuery(con, "SELECT * FROM tbl_currencies")
dbDisconnect(con)

Try it yourself

Now you should be ready to distribute the credentials (username and password) among your team mates, thereby granting them access to all data in your database.

Do you prefer trying first by logging into the example database discussed in this post? Feel free to drop me a line and I’ll be happy to provide you with its credentials!


 

Comments

blog comments powered by Disqus