You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Their analytics team is particularly interested in understanding what songs users are listening to.
5
+
6
+
## Project Description
7
+
8
+
The goal of this project is to develop a data model and ETL process for song play analysis
9
+
10
+
Data modelling is to be done based on the raw data available in json format. Facts and dimension tables are to be defined in star schema using Postgres database
11
+
12
+
ETL pipelines that transfers data from files in json format to Postgres database are to be developed using python
13
+
14
+
## Datasets
15
+
16
+
Data is available in two separate folders under data directory in log_data and song_data
17
+
18
+
### Log Data
19
+
The log_data folder consists of activity logs in json format. The log files are partioned by year and month.
20
+
21
+
- log_data/2018/11/2018-11-12-events.json
22
+
- log_data/2018/11/2018-11-13-events.json
23
+
24
+
Sample data:
25
+
26
+
{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
27
+
28
+
### Song Data
29
+
Each file in song_data folder contains metadata about a song and the artist of the song. The files are partitioned by first three letters of each song's track ID
The schema design used for this project is star schema with one fact table and four dimension tables
41
+
42
+
Star Schema is suitable for this analysis because:
43
+
- The data will de normalized and it helps in faster reads
44
+
- Queries will be simpler and better performing as there are lesser joins
45
+
- We don't have any many to many relationships
46
+
47
+

48
+
49
+
### Fact Table
50
+
**songplays** - Records log data associated with song plays (records with page NextSong)
51
+
52
+
### Dimension Tables
53
+
54
+
**users** - users in the app (user_id, first_name, last_name, gender, level)
55
+
56
+
**songs** - songs in music database (song_id, title, artist_id, year, duration)
57
+
58
+
**artists** - artists in music database (artist_id, name, location, latitude, longitude)
59
+
60
+
**time** - timestamps of records in songplays broken down into specific units (start_time, hour, day, week, month, year, weekday)
61
+
62
+
63
+
64
+
## Project Structure Explanation
65
+
66
+
-**data** directory contains log_data and song_data datasets.
67
+
-**sql_queries.py** contains all sql queries
68
+
-**create_tables.py** drops and creates tables. Used to rest the tables each time before running etl scripts
69
+
-**test.ipynb** displays first few rows of each table, used to check each table
70
+
-**etl.ipynb** reads and processes a single file from song_data and log_data and loads the data into your tables
71
+
-**etl.py** reads and processes all files from song_data and log_data and loads them into database tables
72
+
73
+
## ETL Pipeline Explanation
74
+
75
+
Data is available in two separate folders under data directory in log_data and song_data
76
+
77
+
**Process Song Data** - Each file in song_data folder contains metadata about a song and the artist of the song.
78
+
79
+
Sample Data :
80
+
81
+
{
82
+
"num_songs": 1,
83
+
"artist_id": "ARD7TVE1187B99BFB1",
84
+
"artist_latitude": null,
85
+
"artist_longitude": null,
86
+
"artist_location": "California - LA",
87
+
"artist_name": "Casual",
88
+
"song_id": "SOMZWCG12A8C13C480",
89
+
"title": "I Didn't Mean To",
90
+
"duration": 218.93179,
91
+
"year": 0
92
+
}
93
+
94
+
- Extract ***song_id, title, artist_id, year, duration*** from each file and insert into songs table
95
+
- Extract ***artist_id, artist_name, artist_location, artist_latitude, artist_longitude*** from each file and insert into artists table
96
+
97
+
98
+
**Process Log Data** - Each file in song_data folder contains metadata about a song and the artist of the song.
99
+
100
+
Sample data:
101
+
102
+
{
103
+
"artist":null,
104
+
"auth":"Logged In",
105
+
"firstName":"Walter",
106
+
"gender":"M",
107
+
"itemInSession":0,
108
+
"lastName":"Frye",
109
+
"length":null,
110
+
"level":"free",
111
+
"location":"San Francisco-Oakland-Hayward, CA",
112
+
"method":"GET",
113
+
"page":"Home",
114
+
"registration":1540919166796.0,
115
+
"sessionId":38,
116
+
"song":null,
117
+
"status":200,
118
+
"ts":1541105830796,
119
+
"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
120
+
"userId":"39"
121
+
}
122
+
123
+
- Extract ***ts*** from each entry, extract the ***start_time, hour, day, week, month, year, weekday*** using the ***ts*** value and insert into time table
124
+
- Extract ***user_id, first_name, last_name, gender, level*** from each file and insert into users table
125
+
- --
126
+
-**Loading songplays table :** Extract ***start_time, user_id, level, session_id, location, user_agent*** from log data
127
+
- Using ***song, artist, length*** fields find ***song_id, artist_id*** from songs and artists tables and insert the data into songplays table
128
+
129
+
## Project Execution
130
+
131
+
*Pre Requisite Softwares: Postgres for database, Python for ETL*
132
+
133
+
1. Run **create_tables.py** to create your database and tables
134
+
2. Run **etl.py** to extract the data from data folder and load into tables
{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
2
+
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
3
+
{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
4
+
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
5
+
{"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
6
+
{"artist":"Tamba Trio","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":4,"lastName":"Summers","length":177.18812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Quem Quiser Encontrar O Amor","status":200,"ts":1541106496796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
7
+
{"artist":"The Mars Volta","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":5,"lastName":"Summers","length":380.42077,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Eriatarka","status":200,"ts":1541106673796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
8
+
{"artist":"Infected Mushroom","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":6,"lastName":"Summers","length":440.2673,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Becoming Insane","status":200,"ts":1541107053796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
9
+
{"artist":"Blue October \/ Imogen Heap","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":7,"lastName":"Summers","length":241.3971,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Congratulations","status":200,"ts":1541107493796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
10
+
{"artist":"Girl Talk","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":8,"lastName":"Summers","length":160.15628,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Once again","status":200,"ts":1541107734796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
11
+
{"artist":"Black Eyed Peas","auth":"Logged In","firstName":"Sylvie","gender":"F","itemInSession":0,"lastName":"Cruz","length":214.93506,"level":"free","location":"Washington-Arlington-Alexandria, DC-VA-MD-WV","method":"PUT","page":"NextSong","registration":1540266185796.0,"sessionId":9,"song":"Pump It","status":200,"ts":1541108520796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.77.4 (KHTML, like Gecko) Version\/7.0.5 Safari\/537.77.4\"","userId":"10"}
12
+
{"artist":null,"auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":0,"lastName":"Smith","length":null,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"GET","page":"Home","registration":1541016707796.0,"sessionId":169,"song":null,"status":200,"ts":1541109015796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
13
+
{"artist":"Fall Out Boy","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":1,"lastName":"Smith","length":200.72444,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Nobody Puts Baby In The Corner","status":200,"ts":1541109125796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
14
+
{"artist":"M.I.A.","auth":"Logged In","firstName":"Ryan","gender":"M","itemInSession":2,"lastName":"Smith","length":233.7171,"level":"free","location":"San Jose-Sunnyvale-Santa Clara, CA","method":"PUT","page":"NextSong","registration":1541016707796.0,"sessionId":169,"song":"Mango Pickle Down River (With The Wilcannia Mob)","status":200,"ts":1541109325796,"userAgent":"\"Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/537.36 (KHTML, like Gecko) Ubuntu Chromium\/36.0.1985.125 Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"26"}
15
+
{"artist":"Survivor","auth":"Logged In","firstName":"Jayden","gender":"M","itemInSession":0,"lastName":"Fox","length":245.36771,"level":"free","location":"New Orleans-Metairie, LA","method":"PUT","page":"NextSong","registration":1541033612796.0,"sessionId":100,"song":"Eye Of The Tiger","status":200,"ts":1541110994796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"101"}
0 commit comments