Jump to content

wget clone website - need to get session for each time a new link is visited/downloaded


Recommended Posts

Im trying to clone a vBulletin board for local viewing, im gonna stick with wget since HTT Track doesn't even get past the login screen (we can say it's just whack)

With wget managed to get past the login screen saving the cookies from the login page, but the problem is that the cookie is valid only once. During normal browsing, each time you click on a link within the domain, "part" of the cookie is updated, so it's no longer valid and you need to login again.

What is updated each time is the "intercom-session" hash - everything else remains the same. So wget only returns ONE working link - the rest return the login page

I don't care if i have to make wget pass credentials each time we want to download a new link, but it's also a bit tricky

For example, if you're browsing, want to look at a thread, click on it. Let's now suppose your authentication or whatever it is fails. What you will get is a login form like this:

vBulletin-Login-Timeout.jpg

 

This login form appears on /showtread.php?, /forumdisplay.php? and any part of the site if the cookie has the incorrect session value. But if we try to login from here, it will not work at all, you cant login using that form. The website aks us to go to the actual login page which is for example www.domain.com/member/login and a link is given to us.

So if we were to do it manually, we would need to do:

1-Click on the redirect to the actual login page

2-send credentials

3-return to the thread-forum section-whatever

4-download

Or find a way to get the correct intercom session value and update/get a new cookie each time. Either way im kinda lost now since im not really skilled on this subject...any help appreciated

Works just fine for me using those options.   First, login and get the session setup.

wget --load-cookies jar --save-cookies jar --keep-session-cookies --post-data='user=test&pass=example' https://example.com/login

Then run the command to mirror the site.

wget --load-cookies jar --save-cookies jar --keep-session-cookies -m https://example.com/

I setup a small script that requires a login session and re-generates the session ID on every page load to test those commands with.  Had no problems fetching all the content.

Keep in mind that if there are links on the page that trigger a logout, you need to avoid crawling those links or you'll loose your session.  That might be the issue your running into.  Resolve that using the --reject or --reject-regex option.

 

 

 

On 7/18/2020 at 5:51 PM, kicken said:

Works just fine for me using those options.   First, login and get the session setup.


wget --load-cookies jar --save-cookies jar --keep-session-cookies --post-data='user=test&pass=example' https://example.com/login

Then run the command to mirror the site.


wget --load-cookies jar --save-cookies jar --keep-session-cookies -m https://example.com/

I setup a small script that requires a login session and re-generates the session ID on every page load to test those commands with.  Had no problems fetching all the content.

Keep in mind that if there are links on the page that trigger a logout, you need to avoid crawling those links or you'll loose your session.  That might be the issue your running into.  Resolve that using the --reject or --reject-regex option.

 

 

After a bit of tinkering I got it right, definitely the logout link was screwing me over, it was one of the first links wget went through. I can live with it now lol, just to clarify, if we use accept regex now to only download forumdisplay and showthread files, can we avoid downloading all the useless stuff like members, ids and pretty much everything aside from the threads themselves?

I was thinking to dump all the links and then remove all the unnecessary ones, but don't know if the mirrored site will retain the hierarchy this way

 

 

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.