Tools Mass reddit scrapping brainstorm

rvlnnb

Bathwater Drinker
Nov 27, 2022
144
10,186
1,362
0fya082315al84db03fa9bf467e3.png
Trying to see if I can mass scrape nsfw content from reddit.

We assume you have Python and gallery-dl.

Python:
Please, Log in or Register to view codes content!


Say you have initially extracted a bunch of json from users in REDDIT_USERS_PATH (with gallery-dl -j)

The fetch command will look through these json to fetch nsfw subreddits people have posted in. Then we get top posted for each subreddit for all time periods.
From this list of post, we flatten the list of users who posted.
We can rerun the fetch command multiple times fo feed the cycle.

Once we have a list of users, we can use the download command to actually download the content.

From a small initial seed of 35 subreddits, after a fetch cycle I had 6220 subreddits to fetch. Not sure how many users I will end up with afterwards.

There also the issue of file de-duplication. I create symlinks using Czkawka to remove duplicates.