Scraping Reddit using Python Reddit API Wrapper (PRAW)

Nour Al-Rahman Al-Serw
4 min readFeb 8, 2021
Reddit + Python
Scrape Reddit with Python

Content:
1. Introduction
2. Create Reddit API Account
3. Scraping Reddit post
4. Scraping Reddit subreddits
5. Cleaning the data

Introduction:
With the sharp rise of data it is only going to get better to scrape, gather, collect, amass and many other equally meaningful words of all sorts of information from multiple sources such as Facebook, Twitter and Reddit. With that in mind Reddit for so long has had an API called the Python Reddit API Wrapper, shortened for PRAW, using python (as already in the name I know!) to crawl data. In this article I will show using code snippets how to crawl posts and whole subreddits.

Create Reddit API Account:
First things first a Reddit account is needed, so if you don’t have one you need to create it before moving forward. In order to use PRAW one must first register for the Reddit API, this could be done by following this link. It’s pretty much straightforward and is not that different than signing up to the Twitter API. After finishing the procedure there are three things that are needed in order to invoke the API, they are the client ID, client secret and user agent. How to use them will be explained later on.

Scraping Reddit post:
Now that all the prerequisites have been met scraping Reddit should be easy. There are two libraries that will be used for this exercise: PRAW and Pandas. PRAW is used to access Reddit, and Pandas to tabulate the data and clean it later on. So now first we import the libraries.

Next we need to create the Reddit object where we will insert the credentials created when we registered for the Reddit API.

So now we have an object where we can call the API through. Next we need to pick a post to scrape. For the purpose of this article I already picked a suitable post. What to do in that case is to get the full URL of said post. Then we create a submission object where the URL is passed in as a parameter.

Next is the fun part! The following part is where we use the PRAW object to retrieve the data. First we create a list named posts in order to store all the retrieved posts. Next we create a for-loop in order to go over the comments. In this snippet as shown below as shows we retrieve the top level comment in the submission’s comments. The reason I’m not retrieving the first comment (Which also happens to be the topmost comment) is that in this particular case is because this subreddit has a bot that usually takes that spot and not something we ought to be scraping (then again it’s up to you). Inside is a nested if-loop to check if the anchored comment is either a top level comment is a MoreComments object which means if it does have more comments. If so then append the comment’s text body into posts. Then next after the for loop is done the posts list is then changed into a pandas data frame. With a column of all the posts named “body”.

After running the previous code something like the following should show up.The body column as shown has the comments pulled from the post and on the left is basically the index.

Table showing retrieved comments
Table showing retrieved comments

Now we have all the top level comments that are present in this post.

Here is the full script. concerning the last few lines I will explain how to do some simple cleaning later.

Scraping Reddit subbreddits:
Previously we have collected comments found in a singular post, next we will tackle how to scrape all the top level comments found in a subreddit. This should not be that different from scraping a singular post, that is if you don’t factor in the time. We will follow most of the steps done before but with little tweaks.

We will still recreate the first two steps where we import the libraries and create the subreddit object.

Next we will attempt to access all the submissions within a subreddit and get all the top level comments. First there is also the list named posts. When picking the subreddit, we insert it in the second line as the value passed in “reddit.subreddit(“INSERTSUBREDDIT”)” . In this case we loop over all the submissions retrieved from accessing the subreddit directly. In order to scrap the most possible amount of comments we add the “.top(“all”)” to the subreddit to get the most up-voted posts. Next is a for-loop going overall the comments in the anchored submission. If this comment has more comments then append it to the list posts and then by the end of it change posts into a Pandas data frame.

Next let’s take a look at what we got.

Comments retrieved from all submissions within a subreddit
Comments retrieved from all submissions within a subreddit

Cleaning the data:
Last but not least is data cleansing. As with real data it is more than often mired with incomplete or faulty data. One thing we ought to do with reddit data in particular is replace removed and deleted comments along with their indexes.

Next we drop the selected indexes and replace them.

--

--