How to import csv into Docker Postgresql database

Why

3 min readApr 30, 2019

There are lots of data around us. We collect data from Google Analytics, from surveys, or, we get them from Kaggle— thanks to many generous people sharing datasets on Kaggle.

We could use the data set to understand the trend and patterns of the world around us. E.g. the Salary in SF, Happiness Report by Country

Interesting Datasets on Kaggle.com

Kaggle.com is one of the most popular websites amongst Data Scientists and Machine Learning Engineers. Although Kaggle…

towardsdatascience.com

For my purpose, I want to create a database populated with data so that I could have my caching experiment — Making a small app to make request to the database and comparing the difference in the request numbers when caching is applied.

But first, I need to have a database populated with data.

This post is an account of how I populate a database with datasets in csv downloaded from Kaggle.

Process

Download the CSV file from Kaggle
Create a Docker database with schema
Import the CSV into the database

The process seems pretty straightforward, with a small tricky part — the mounted csv needs to be placed in the mounted volume and accessed via the database inside Docker.

Download the CSV file from Kaggle

There are tens and thousands of datasets available on Kaggle from image to neurolinguistics. To keep myself interested and engaged, I chose myanimelist datasets so that I could also learn about popular animes.

2. Create a Docker database

Before importing csv files into the database, we need to create the corresponding schema for each column in the csv. To create the schema during database creation, we could write a .sql script to create schema and run the script during database creation.

// setup.sql
// there are more columns in AnimeList.csv but i am using a simplified dataset for now
// Data may need to be cleaned further before importCREATE TABLE anime (  anime_id INT PRIMARY KEY,  title TEXT);

Then, in the Dockerfile, place the setup.sql inside /docker-entrypoint-initdb.d folder to create the schema in the initialization phase.

// DockerfileFROM postgres:alpineCOPY *.sql /docker-entrypoint-initdb.d/ADD setup.sql /docker-entrypoint-initdb.dRUN chmod a+r /docker-entrypoint-initdb.d/*EXPOSE 6666

After we create the Dockerfile and setup.sql, we could build the docker image:

myrepo: your space on the DockerHub
mypostgres: the name you give to the database image
2: the tag

docker build -t myrepo/mypostgres:2 .

Then start an instance of the database :

v: map the persistent volume paths between the host and docker
-e: supply the environment variables
-p: map the port number and expose the port as 6666
--rm: remove the database when the instance is stopped

docker run --rm --name mydb -v $HOME/Sync/animeDB/pgdata:/var/lib/postgresql/data -e POSTGRES_PASSWORD=1234 -e POSTGRES_DB=your_database -p 6666:5432 myrepo/mypostgres:2

Inside the local folder $HOME/Sync/animeDB, you can see a /pgdata folder gets created. There are lots of auto generated postgresql data created inside the /pgdata folder. We need to place the csv file inside the /pgdata folder so that the your_database can read it from the /var/lib/postgresql/data folder inside Docker.

mv {PATH_TO}/AnimeList.csv pgdata/

Import CSV into database

Then we can start psql client on the Docker database :

-U: pass the user
-d: indicate what database we are importing the csv into. Since we created the schema on the database your_database, we will also import csv here

// Run psqldocker exec -it mydb psql -U postgres -d your_database
// import CSV\copy anime FROM '/var/lib/postgresql/data/AnimeList.csv' DELIMITER ',' CSV HEADER;

Now if you see message like COPY 2 on terminal, you know the import was successful.

you could connect to the database on postgres://postgres:1234@localhost:6666/your_database on a client such as Sqlectron.

Then you can confirm all the data have been imported correctly!