Question Scraping Links only from SimpCity Threads

harrypetson1123

Tier 3 Sub
Mar 11, 2022
39
415
639
0fya082315al84db03fa9bf467e3.png
Hi,

Is there a script that can scrape SC threads for URLs, only scrape them and not download the content within them?

Just looking for convenience as some threads have a lot of pages to go through, and it would be easier to scrape, compile and open the URLs as you need to see if there is content that your looking for to download.

I've tried using the search bar to find something and wasn't able to find anything that I'm looking for.
 
Last edited:

itbftw

Lurker
Mar 14, 2022
7
1
9
58
0fya082315al84db03fa9bf467e3.png
If you are just trying to get every url/href on a page, go to that page, open developer tools, paste this into console and hit enter.

var a=[],l=document.links;for(i=0;i<l.length;i++){a.push(l[i].href)};a.forEach(function(e){console.log(e)});
 

harrypetson1123

Tier 3 Sub
Mar 11, 2022
39
415
639
0fya082315al84db03fa9bf467e3.png
I guess this will work, was hoping for some way of doing multiple pages. But I can make this work, beggars can't be choosers.

I'll leave this question open for another day and see if anyone else has any solutions, otherwise I'll close it.

Thanks dude!
 

itbftw

Lurker
Mar 14, 2022
7
1
9
58
0fya082315al84db03fa9bf467e3.png
i mean you can do this powershell pretty easy
You need to get your cookies first:
dev tools, network tab, refresh page, top one right click > Copy > Copy ALL as powershell
Open notepad or whatever editor, paste and get the cookie info (you'll see it in code below).
I'm not sure your desired output, but you can tweak your own powershell to something like the below. Put all the links in the the $parse variable on new lines, and it just dumps to text. Obviously update your cookies, and i'm sure some of the header values don't need to be there, but this is just a copy/paste and quick foreach loop..

Code:
Please, Log in or Register to view codes content!
 

harrypetson1123

Tier 3 Sub
Mar 11, 2022
39
415
639
0fya082315al84db03fa9bf467e3.png
Should have been clear when I stated multiple pages, I meant as in the multiple pages in one thread.

But the Powershell script you provided works fantastic, as I can specify the URLs for each page in a thread and other threads and their multiple pages and so on....

Thank you for the script(y)
 

itbftw

Lurker
Mar 14, 2022
7
1
9
58
0fya082315al84db03fa9bf467e3.png
Oh, same concept just loop through them... may need test, but you essentially use first page of multi page thread in url var and should go through them all.

Code:
Please, Log in or Register to view codes content!
 
  • Like
Reactions: S40s1n19
Solution

harrypetson1123

Tier 3 Sub
Mar 11, 2022
39
415
639
0fya082315al84db03fa9bf467e3.png
Fantastic - Even Better.

Ran perfectly on the first go, did notice that there was an issue with threads with more than 10 pages. It seem the Sort-Object doesn't work correctly as when it ordered the page numbers in the breadcumb it put 10 at the top. So when picking the last number, it would pick 9.

So after a bit of fiddling and inspecting for about an hour or so, I managed to fix it with the below code.

I had to change this line:
$breadcrumbs = $page.Links.href | ? {$_ -match "\/page-(\d+)$"} |Select-Object -Unique | Sort-Object | select -Last 1

To be:
$breadcrumbs = $page.Links.href | ? {$_ -match "\/page-(\d+)$"} |Select-Object -Unique | Sort-Object {[int]($_.basename -replace '\D')} | select -Last 1