If you're looking for a how to setup. I'll include a short how to underneath the summary "How to"
This project is used to practice Python and SQL together.
The purpose of this project was to get a better understandment of Reddit and what controls their engagement. The hypothesis that the amount of comments per a submission drives the upvote counter higher, thus to the front page. I was also curious on which titles or words were most often on each subreddit on the /r/all.
The question comes up, I have this cool idea and want to spread this idea far and wide. Which subreddit do I choose? Which wording do I go with and which do I avoid? Do you want to focus on the most comment engagment to drive conversation or do you just want some time to shine on the front page of reddit? Maybe a mix of all of the above? With the right data we can give answers to all these...
The intial answer well answer together is what drivers most upvotes? Is it the amount of comments, maybe subreddit, or subscribers? Most likely a mix of all three, but I went we with a hypothesis that comments drives the amount of upvotes higher. Thus what's the ratio of upvotes and comments per each subreddit? We will need a few bits of data:
- The title
- Subreddit
- Number of Upvotes
- Number of Comments
- Number of Subscribers
- The rank of the submission
Note: Version 3 has expanded this to submission ID, time created, time of the pull, words of the title, name of the author, author ID, author's comment and link karma, stored in relational database to save space for later use and analysis
- My hypothesis that comments do drive engagment was shown by correlation to upvotes of .289899, however subscribers to the particular subreddit had a higher correlation than comments at .313972. However to have a valid conclusion we'll have to wait a little bit for the a bigger sample size on the wording question as well.
Use Reddit API to gather information then use it gage engagement
You'll need:
PuTTy - https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
PGadmin4
Reddit API account/PRAW
psycopg2
BlockingScheduler(if you want data to pull at set times)
I'm using a PostgreSQL database, but if you want to use another SQL just change out psycopg2 for the corresponding library import.
How to install/use - Windows
- Setup a server - I used DigitalOcean.com - the server does have to support the database language you choose(SQL)
- Install PuTTy - type in your host name/address = I.P. address of server, port 22
- When logged in on Ubuntu type "adduser yourname" without quotes then "usermod -aG sudo yourname" yourname is whatever you want it to be
- command list:
python3
sudo apt-get update
sudo apt-get install mc
sudo apt-get -y install python3-pip
sudo apt-get -y install python3-dev
sudo -H pip3 install --upgrade pip
sudo apt-get install postgresql postgresql-contrib note this changes if you want to a different use a different SQL language
sudo -i -u postgres
CREATE USER yourname WITH PASSWORD 'yourpassword'
sudo -i -u root - Open PGadmin4 using yourname as User name and yourpassword as password
keep in mind that your server may not be listening to your program. - Don't forget to install PRAW, Psycopg2, and BlockingScheduler
- Open and save both files. Reddit.py is the initial setup for making the table and pulling the data. Reddit_pull.py pulls the data continously