wget clone website - need to get session for each time a new link is visited/downloaded

V12POWER · July 18, 2020

Im trying to clone a vBulletin board for local viewing, im gonna stick with wget since HTT Track doesn't even get past the login screen (we can say it's just whack)

With wget managed to get past the login screen saving the cookies from the login page, but the problem is that the cookie is valid only once. During normal browsing, each time you click on a link within the domain, "part" of the cookie is updated, so it's no longer valid and you need to login again.

What is updated each time is the "intercom-session" hash - everything else remains the same. So wget only returns ONE working link - the rest return the login page

I don't care if i have to make wget pass credentials each time we want to download a new link, but it's also a bit tricky

For example, if you're browsing, want to look at a thread, click on it. Let's now suppose your authentication or whatever it is fails. What you will get is a login form like this:

This login form appears on /showtread.php?, /forumdisplay.php? and any part of the site if the cookie has the incorrect session value. But if we try to login from here, it will not work at all, you cant login using that form. The website aks us to go to the actual login page which is for example www.domain.com/member/login and a link is given to us.

So if we were to do it manually, we would need to do:

1-Click on the redirect to the actual login page

2-send credentials

3-return to the thread-forum section-whatever

4-download

Or find a way to get the correct intercom session value and update/get a new cookie each time. Either way im kinda lost now since im not really skilled on this subject...any help appreciated

kicken · July 18, 2020

Try looking into the --load-cookies, --save-cookies, and --keep-session-cookies options.

V12POWER · July 18, 2020

yes that's the method im using right now but it doesnt work. Keeping the cookies is only valid for the first link that wget downloads, the following links return the login page, all of them

kicken · July 18, 2020

Works just fine for me using those options. First, login and get the session setup.

wget --load-cookies jar --save-cookies jar --keep-session-cookies --post-data='user=test&pass=example' https://example.com/login

Then run the command to mirror the site.

wget --load-cookies jar --save-cookies jar --keep-session-cookies -m https://example.com/

I setup a small script that requires a login session and re-generates the session ID on every page load to test those commands with. Had no problems fetching all the content.

Keep in mind that if there are links on the page that trigger a logout, you need to avoid crawling those links or you'll loose your session. That might be the issue your running into. Resolve that using the --reject or --reject-regex option.

V12POWER · July 22, 2020

On 7/18/2020 at 5:51 PM, kicken said:
Works just fine for me using those options. First, login and get the session setup.
wget --load-cookies jar --save-cookies jar --keep-session-cookies --post-data='user=test&pass=example' https://example.com/login
Then run the command to mirror the site.
wget --load-cookies jar --save-cookies jar --keep-session-cookies -m https://example.com/
I setup a small script that requires a login session and re-generates the session ID on every page load to test those commands with. Had no problems fetching all the content.

Keep in mind that if there are links on the page that trigger a logout, you need to avoid crawling those links or you'll loose your session. That might be the issue your running into. Resolve that using the --reject or --reject-regex option.

After a bit of tinkering I got it right, definitely the logout link was screwing me over, it was one of the first links wget went through. I can live with it now lol, just to clarify, if we use accept regex now to only download forumdisplay and showthread files, can we avoid downloading all the useless stuff like members, ids and pretty much everything aside from the threads themselves?

I was thinking to dump all the links and then remove all the unnecessary ones, but don't know if the mirrored site will retain the hierarchy this way

Sign In

wget clone website - need to get session for each time a new link is visited/downloaded

Recommended Posts

V12POWER

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

V12POWER

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

V12POWER

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information