How to import csv into Docker Postgresql database
Why
There are lots of data around us. We collect data from Google Analytics, from surveys, or, we get them from Kaggle— thanks to many generous people sharing datasets on Kaggle.
We could use the data set to understand the trend and patterns of the world around us. E.g. the Salary in SF, Happiness Report by Country
For my purpose, I want to create a database populated with data so that I could have my caching experiment — Making a small app to make request to the database and comparing the difference in the request numbers when caching is applied.
But first, I need to have a database populated with data.
This post is an account of how I populate a database with datasets in csv downloaded from Kaggle.
Process
- Download the CSV file from Kaggle
- Create a Docker database with schema
- Import the CSV into the database
The process seems pretty straightforward, with a small tricky part — the mounted csv needs to be placed in the mounted volume and accessed via the database inside Docker.
Download the CSV file from Kaggle
There are tens and thousands of datasets available on Kaggle from image to neurolinguistics. To keep myself interested and engaged, I chose myanimelist datasets so that I could also learn about popular animes.
2. Create a Docker database
Before importing csv files into the database, we need to create the corresponding schema for each column in the csv. To create the schema during database creation, we could write a .sql
script to create schema and run the script during database creation.
// setup.sql
// there are more columns in AnimeList.csv but i am using a simplified dataset for now
// Data may need to be cleaned further before importCREATE TABLE anime ( anime_id INT PRIMARY KEY, title TEXT);
Then, in the Dockerfile, place the setup.sql
inside /docker-entrypoint-initdb.d
folder to create the schema in the initialization phase.
// DockerfileFROM postgres:alpineCOPY *.sql /docker-entrypoint-initdb.d/ADD setup.sql /docker-entrypoint-initdb.dRUN chmod a+r /docker-entrypoint-initdb.d/*EXPOSE 6666
After we create the Dockerfile
and setup.sql
, we could build the docker image:
- myrepo: your space on the DockerHub
- mypostgres: the name you give to the database image
- 2: the tag
docker build -t myrepo/mypostgres:2 .
Then start an instance of the database :
- v: map the persistent volume paths between the host and docker
- -e: supply the environment variables
- -p: map the port number and expose the port as 6666
- --rm: remove the database when the instance is stopped
docker run --rm --name mydb -v $HOME/Sync/animeDB/pgdata:/var/lib/postgresql/data -e POSTGRES_PASSWORD=1234 -e POSTGRES_DB=your_database -p 6666:5432 myrepo/mypostgres:2
Inside the local folder $HOME/Sync/animeDB
, you can see a /pgdata
folder gets created. There are lots of auto generated postgresql data created inside the /pgdata
folder. We need to place the csv file inside the /pgdata
folder so that the your_database
can read it from the /var/lib/postgresql/data
folder inside Docker.
mv {PATH_TO}/AnimeList.csv pgdata/
Import CSV into database
Then we can start psql
client on the Docker database :
- -U: pass the user
- -d: indicate what database we are importing the csv into. Since we created the schema on the database
your_database
, we will also import csv here
// Run psqldocker exec -it mydb psql -U postgres -d your_database
// import CSV\copy anime FROM '/var/lib/postgresql/data/AnimeList.csv' DELIMITER ',' CSV HEADER;
Now if you see message like COPY 2
on terminal, you know the import was successful.
you could connect to the database on postgres://postgres:1234@localhost:6666/your_database
on a client such as Sqlectron.
Then you can confirm all the data have been imported correctly!