V12POWER Posted July 18, 2020 Share Posted July 18, 2020 Im trying to clone a vBulletin board for local viewing, im gonna stick with wget since HTT Track doesn't even get past the login screen (we can say it's just whack) With wget managed to get past the login screen saving the cookies from the login page, but the problem is that the cookie is valid only once. During normal browsing, each time you click on a link within the domain, "part" of the cookie is updated, so it's no longer valid and you need to login again. What is updated each time is the "intercom-session" hash - everything else remains the same. So wget only returns ONE working link - the rest return the login page I don't care if i have to make wget pass credentials each time we want to download a new link, but it's also a bit tricky For example, if you're browsing, want to look at a thread, click on it. Let's now suppose your authentication or whatever it is fails. What you will get is a login form like this: This login form appears on /showtread.php?, /forumdisplay.php? and any part of the site if the cookie has the incorrect session value. But if we try to login from here, it will not work at all, you cant login using that form. The website aks us to go to the actual login page which is for example www.domain.com/member/login and a link is given to us. So if we were to do it manually, we would need to do: 1-Click on the redirect to the actual login page 2-send credentials 3-return to the thread-forum section-whatever 4-download Or find a way to get the correct intercom session value and update/get a new cookie each time. Either way im kinda lost now since im not really skilled on this subject...any help appreciated Quote Link to comment Share on other sites More sharing options...
kicken Posted July 18, 2020 Share Posted July 18, 2020 Try looking into the --load-cookies, --save-cookies, and --keep-session-cookies options. Quote Link to comment Share on other sites More sharing options...
V12POWER Posted July 18, 2020 Author Share Posted July 18, 2020 yes that's the method im using right now but it doesnt work. Keeping the cookies is only valid for the first link that wget downloads, the following links return the login page, all of them Quote Link to comment Share on other sites More sharing options...
kicken Posted July 18, 2020 Share Posted July 18, 2020 Works just fine for me using those options. First, login and get the session setup. wget --load-cookies jar --save-cookies jar --keep-session-cookies --post-data='user=test&pass=example' https://example.com/login Then run the command to mirror the site. wget --load-cookies jar --save-cookies jar --keep-session-cookies -m https://example.com/ I setup a small script that requires a login session and re-generates the session ID on every page load to test those commands with. Had no problems fetching all the content. Keep in mind that if there are links on the page that trigger a logout, you need to avoid crawling those links or you'll loose your session. That might be the issue your running into. Resolve that using the --reject or --reject-regex option. Quote Link to comment Share on other sites More sharing options...
V12POWER Posted July 22, 2020 Author Share Posted July 22, 2020 On 7/18/2020 at 5:51 PM, kicken said: Works just fine for me using those options. First, login and get the session setup. wget --load-cookies jar --save-cookies jar --keep-session-cookies --post-data='user=test&pass=example' https://example.com/login Then run the command to mirror the site. wget --load-cookies jar --save-cookies jar --keep-session-cookies -m https://example.com/ I setup a small script that requires a login session and re-generates the session ID on every page load to test those commands with. Had no problems fetching all the content. Keep in mind that if there are links on the page that trigger a logout, you need to avoid crawling those links or you'll loose your session. That might be the issue your running into. Resolve that using the --reject or --reject-regex option. After a bit of tinkering I got it right, definitely the logout link was screwing me over, it was one of the first links wget went through. I can live with it now lol, just to clarify, if we use accept regex now to only download forumdisplay and showthread files, can we avoid downloading all the useless stuff like members, ids and pretty much everything aside from the threads themselves? I was thinking to dump all the links and then remove all the unnecessary ones, but don't know if the mirrored site will retain the hierarchy this way Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.