Trying to see if I can mass scrape nsfw content from reddit.
We assume you have Python and gallery-dl.
Say you have initially extracted a bunch of json from users in REDDIT_USERS_PATH (with gallery-dl -j)
The fetch command will look through these json to fetch nsfw subreddits people have posted in. Then we get top posted for each subreddit for all time periods.
From this list of post, we flatten the list of users who posted.
We can rerun the fetch command multiple times fo feed the cycle.
Once we have a list of users, we can use the download command to actually download the content.
From a small initial seed of 35 subreddits, after a fetch cycle I had 6220 subreddits to fetch. Not sure how many users I will end up with afterwards.
There also the issue of file de-duplication. I create symlinks using Czkawka to remove duplicates.
We assume you have Python and gallery-dl.
Say you have initially extracted a bunch of json from users in REDDIT_USERS_PATH (with gallery-dl -j)
The fetch command will look through these json to fetch nsfw subreddits people have posted in. Then we get top posted for each subreddit for all time periods.
From this list of post, we flatten the list of users who posted.
We can rerun the fetch command multiple times fo feed the cycle.
Once we have a list of users, we can use the download command to actually download the content.
From a small initial seed of 35 subreddits, after a fetch cycle I had 6220 subreddits to fetch. Not sure how many users I will end up with afterwards.
There also the issue of file de-duplication. I create symlinks using Czkawka to remove duplicates.