Skip to content

Commit 033defa

Browse files
committed
Initial Commit
1 parent ebf2a79 commit 033defa

File tree

151 files changed

+27528
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

151 files changed

+27528
-0
lines changed

.gitignore

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# PyInstaller
31+
# Usually these files are written by a python script from a template
32+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33+
*.manifest
34+
*.spec
35+
36+
# Installer logs
37+
pip-log.txt
38+
pip-delete-this-directory.txt
39+
40+
# Unit test / coverage reports
41+
htmlcov/
42+
.tox/
43+
.nox/
44+
.coverage
45+
.coverage.*
46+
.cache
47+
nosetests.xml
48+
coverage.xml
49+
*.cover
50+
*.py,cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
target/
76+
77+
# Jupyter Notebook
78+
.ipynb_checkpoints
79+
80+
# IPython
81+
profile_default/
82+
ipython_config.py
83+
84+
# pyenv
85+
.python-version
86+
87+
# pipenv
88+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91+
# install all needed dependencies.
92+
#Pipfile.lock
93+
94+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95+
__pypackages__/
96+
97+
# Celery stuff
98+
celerybeat-schedule
99+
celerybeat.pid
100+
101+
# SageMath parsed files
102+
*.sage.py
103+
104+
# Environments
105+
.env
106+
.venv
107+
env/
108+
venv/
109+
ENV/
110+
env.bak/
111+
venv.bak/
112+
113+
# Spyder project settings
114+
.spyderproject
115+
.spyproject
116+
117+
# Rope project settings
118+
.ropeproject
119+
120+
# mkdocs documentation
121+
/site
122+
123+
# mypy
124+
.mypy_cache/
125+
.dmypy.json
126+
dmypy.json
127+
128+
# Pyre type checker
129+
.pyre/

DataModelingWithPostgres/README.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Data Modelling with Postgres
2+
3+
## Introduction
4+
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Their analytics team is particularly interested in understanding what songs users are listening to.
5+
6+
## Project Description
7+
8+
The goal of this project is to develop a data model and ETL process for song play analysis
9+
10+
Data modelling is to be done based on the raw data available in json format. Facts and dimension tables are to be defined in star schema using Postgres database
11+
12+
ETL pipelines that transfers data from files in json format to Postgres database are to be developed using python
13+
14+
## Datasets
15+
16+
Data is available in two separate folders under data directory in log_data and song_data
17+
18+
### Log Data
19+
The log_data folder consists of activity logs in json format. The log files are partioned by year and month.
20+
21+
- log_data/2018/11/2018-11-12-events.json
22+
- log_data/2018/11/2018-11-13-events.json
23+
24+
Sample data:
25+
26+
{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
27+
28+
### Song Data
29+
Each file in song_data folder contains metadata about a song and the artist of the song. The files are partitioned by first three letters of each song's track ID
30+
31+
- song_data/A/B/C/TRABCEI128F424C983.json
32+
- song_data/A/A/B/TRAABJL12903CDCF1A.json
33+
34+
Sample Data
35+
36+
{"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}
37+
38+
## Database Schema
39+
40+
The schema design used for this project is star schema with one fact table and four dimension tables
41+
42+
Star Schema is suitable for this analysis because:
43+
- The data will de normalized and it helps in faster reads
44+
- Queries will be simpler and better performing as there are lesser joins
45+
- We don't have any many to many relationships
46+
47+
![Star Schema for Sparkify](https://github.com/RangaAmirapu/DataEngineeringProjects/blob/master/DataModelingWithPostgres/sparkify_erd.png?raw=true)
48+
49+
### Fact Table
50+
**songplays** - Records log data associated with song plays (records with page NextSong)
51+
52+
### Dimension Tables
53+
54+
**users** - users in the app (user_id, first_name, last_name, gender, level)
55+
56+
**songs** - songs in music database (song_id, title, artist_id, year, duration)
57+
58+
**artists** - artists in music database (artist_id, name, location, latitude, longitude)
59+
60+
**time** - timestamps of records in songplays broken down into specific units (start_time, hour, day, week, month, year, weekday)
61+
62+
63+
64+
## Project Structure Explanation
65+
66+
- **data** directory contains log_data and song_data datasets.
67+
- **sql_queries.py** contains all sql queries
68+
- **create_tables.py** drops and creates tables. Used to rest the tables each time before running etl scripts
69+
- **test.ipynb** displays first few rows of each table, used to check each table
70+
- **etl.ipynb** reads and processes a single file from song_data and log_data and loads the data into your tables
71+
- **etl.py** reads and processes all files from song_data and log_data and loads them into database tables
72+
73+
## ETL Pipeline Explanation
74+
75+
Data is available in two separate folders under data directory in log_data and song_data
76+
77+
**Process Song Data** - Each file in song_data folder contains metadata about a song and the artist of the song.
78+
79+
Sample Data :
80+
81+
{
82+
"num_songs": 1,
83+
"artist_id": "ARD7TVE1187B99BFB1",
84+
"artist_latitude": null,
85+
"artist_longitude": null,
86+
"artist_location": "California - LA",
87+
"artist_name": "Casual",
88+
"song_id": "SOMZWCG12A8C13C480",
89+
"title": "I Didn't Mean To",
90+
"duration": 218.93179,
91+
"year": 0
92+
}
93+
94+
- Extract ***song_id, title, artist_id, year, duration*** from each file and insert into songs table
95+
- Extract ***artist_id, artist_name, artist_location, artist_latitude, artist_longitude*** from each file and insert into artists table
96+
97+
98+
**Process Log Data** - Each file in song_data folder contains metadata about a song and the artist of the song.
99+
100+
Sample data:
101+
102+
{
103+
"artist":null,
104+
"auth":"Logged In",
105+
"firstName":"Walter",
106+
"gender":"M",
107+
"itemInSession":0,
108+
"lastName":"Frye",
109+
"length":null,
110+
"level":"free",
111+
"location":"San Francisco-Oakland-Hayward, CA",
112+
"method":"GET",
113+
"page":"Home",
114+
"registration":1540919166796.0,
115+
"sessionId":38,
116+
"song":null,
117+
"status":200,
118+
"ts":1541105830796,
119+
"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
120+
"userId":"39"
121+
}
122+
123+
- Extract ***ts*** from each entry, extract the ***start_time, hour, day, week, month, year, weekday*** using the ***ts*** value and insert into time table
124+
- Extract ***user_id, first_name, last_name, gender, level*** from each file and insert into users table
125+
- --
126+
- **Loading songplays table :** Extract ***start_time, user_id, level, session_id, location, user_agent*** from log data
127+
- Using ***song, artist, length*** fields find ***song_id, artist_id*** from songs and artists tables and insert the data into songplays table
128+
129+
## Project Execution
130+
131+
*Pre Requisite Softwares: Postgres for database, Python for ETL*
132+
133+
1. Run **create_tables.py** to create your database and tables
134+
2. Run **etl.py** to extract the data from data folder and load into tables
135+
3. Run **test.ipynb** to verify data load
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
import psycopg2
2+
from sql_queries import create_table_queries, drop_table_queries
3+
4+
5+
def create_database():
6+
"""
7+
- Creates and connects to the sparkifydb
8+
- Returns the connection and cursor to sparkifydb
9+
"""
10+
11+
# connect to default database
12+
conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
13+
conn.set_session(autocommit=True)
14+
cur = conn.cursor()
15+
16+
# create sparkify database with UTF8 encoding
17+
cur.execute("DROP DATABASE IF EXISTS sparkifydb")
18+
cur.execute("CREATE DATABASE sparkifydb WITH ENCODING 'utf8' TEMPLATE template0")
19+
20+
# close connection to default database
21+
conn.close()
22+
23+
# connect to sparkify database
24+
conn = psycopg2.connect("host=127.0.0.1 dbname=sparkifydb user=student password=student")
25+
cur = conn.cursor()
26+
27+
return cur, conn
28+
29+
30+
def drop_tables(cur, conn):
31+
"""
32+
Drops each table using the queries in `drop_table_queries` list.
33+
"""
34+
for query in drop_table_queries:
35+
cur.execute(query)
36+
conn.commit()
37+
38+
39+
def create_tables(cur, conn):
40+
"""
41+
Creates each table using the queries in `create_table_queries` list.
42+
"""
43+
for query in create_table_queries:
44+
cur.execute(query)
45+
conn.commit()
46+
47+
48+
def main():
49+
"""
50+
- Drops (if exists) and Creates the sparkify database.
51+
52+
- Establishes connection with the sparkify database and gets
53+
cursor to it.
54+
55+
- Drops all the tables.
56+
57+
- Creates all tables needed.
58+
59+
- Finally, closes the connection.
60+
"""
61+
cur, conn = create_database()
62+
63+
drop_tables(cur, conn)
64+
create_tables(cur, conn)
65+
66+
conn.close()
67+
68+
69+
if __name__ == "__main__":
70+
main()
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
2+
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
3+
{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
4+
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
5+
{"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
6+
{"artist":"Tamba Trio","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":4,"lastName":"Summers","length":177.18812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Quem Quiser Encontrar O Amor","status":200,"ts":1541106496796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
7+
{"artist":"The Mars Volta","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":5,"lastName":"Summers","length":380.42077,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Eriatarka","status":200,"ts":1541106673796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
8+
{"artist":"Infected Mushroom","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":6,"lastName":"Summers","length":440.2673,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Becoming Insane","status":200,"ts":1541107053796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
9+
{"artist":"Blue October \/ Imogen Heap","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":7,"lastName":"Summers","length":241.3971,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Congratulations","status":200,"ts":1541107493796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
10+
{"artist":"Girl Talk","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":8,"lastName":"Summers","length":160.15628,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Once again","status":200,"ts":1541107734796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
11+
{"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":214.93506,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"PUT","page":"NextSong","registration":1540266185796.0,"sessionId":9,"song":"Pump It","status":200,"ts":1541108520796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"}
12+
{"artist":null,"auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":0,"lastName":"Smith","length":null,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"GET","page":"Home","registration":1541016707796.0,"sessionId":169,"song":null,"status":200,"ts":1541109015796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
13+
{"artist":"Fall Out Boy","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":1,"lastName":"Smith","length":200.72444,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Nobody Puts Baby In The Corner","status":200,"ts":1541109125796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
14+
{"artist":"M.I.A.","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":2,"lastName":"Smith","length":233.7171,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Mango Pickle Down River (With The Wilcannia Mob)","status":200,"ts":1541109325796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
15+
{"artist":"Survivor","auth":"Logged In","firstName":"Jayden","gender":"M","itemInSession":0,"lastName":"Fox","length":245.36771,"level":"free","location":"New Orleans-Metairie, LA","method":"PUT","page":"NextSong","registration":1541033612796.0,"sessionId":100,"song":"Eye Of The Tiger","status":200,"ts":1541110994796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"101"}

0 commit comments

Comments
 (0)