dil_bert Posted February 9, 2018 Share Posted February 9, 2018 (edited) good day dear PHP-Freaks i have a little scraper / parser that works very very well on one page. so far so good. well now i have to extend this: for this little scraper / parser i have to visit 600 pages here some musings : on a first glance the issue about scraping from page to page - can be solved via different approaches:we have the pagination on the bottom of the page: see for example: http://mypage.com=&&page=5 and http://mypage.com=&&page=6 and http://mypage.com=&&page=7 and so forth and so forth..well we can set this url (s) as a base -if we have an array from which we load the urls that need to be visited - we would come across all the pages...so we have approx 305 Pages that we have to visit. we can increment the pages (that are shown above) and count to the number of 305some musings: Hardcoding the total number of pages isn't practical as it could vary. we could:- extract the number of results from the first page,- extract the url from the "last" link at the bottom of the page, create a URI object and read the page number from the query string.Note that I say round down above, because the query page number begins at 0, not 1.Looping over could be as simple as: my $url_pattern = 'https://mypage.com/..... &page=%s'; for my $page ( 0 .. $last ) { my $url = sprintf $url_pattern, $page; ... } Musings about solutions: well another solution: one would probably try to incorporate paging into the $conf, perhaps an iterator which upon each call fetches the next node, behind the scenes it automaticallyincrements the page when there are no nodes left until there are no pages left. AND this solution is gathered from https://perlmaven.com/simple-way-to-fetch-many-web-pagesFinally we arrived giving an example of downloading many pages using HTTP::Tinysee the example: my $ht = HTTP::Tiny->new; foreach my $url (@urls) { say "Start $url"; my $response = $ht->get($url); if ($response->{success}) { say 'Length: ', length $response->{content}; } else { say "Failed: $response->{status} $response->{reasons}"; } } regarding the code: The code is, quite straight forward. We have a list of URLs in the @urls array. An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over each url and fetch it. In order to save space in this article I only printed the size of each page. This is the result: inally we arrived giving an example of downloading many pages using HTTP::Tiny. Well - how would you solve this thing?! Love to hear from you Edited February 9, 2018 by dil_bert Quote Link to comment https://forums.phpfreaks.com/topic/306480-how-to-loop-over-600-pages/ Share on other sites More sharing options...
Psycho Posted February 9, 2018 Share Posted February 9, 2018 Not know the context of what you are scraping, I can't say what solution I would follow. I have no experience with HTTP:Tiny, so I can't comment on that. The second option of interrogating the first page to get the total number and then creating an iterator sound logical. Another option in that same style would be to see what is returned when a page is requested that does not exist. E.g. if there are 305 pages and you were to request this page http://mypage.com=&&page=306, and see what is returned. Then create a loop that continually increments the page value until you get a response back matching a page that does not exist. Or, just continually get the next page as long as the current page contains the content that you expect it to have. 1 Quote Link to comment https://forums.phpfreaks.com/topic/306480-how-to-loop-over-600-pages/#findComment-1556198 Share on other sites More sharing options...
dil_bert Posted February 11, 2018 Author Share Posted February 11, 2018 hello dear Psycho. many thanks for your posting with the ideas and tipps. note: the page i am getting the data from is the following: http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5 and http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6 and http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7 well one soultion could be the hardcoding solution - i put all the urls in an array. Quote Link to comment https://forums.phpfreaks.com/topic/306480-how-to-loop-over-600-pages/#findComment-1556261 Share on other sites More sharing options...
Psycho Posted February 12, 2018 Share Posted February 12, 2018 (edited) OK, looking at the site, you can get the number of pages by going to the first page and inspecting the link for the "last »" link: europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=288 Get that value then create a loop to iterate over the remaining pages. Edited February 12, 2018 by Psycho 1 Quote Link to comment https://forums.phpfreaks.com/topic/306480-how-to-loop-over-600-pages/#findComment-1556304 Share on other sites More sharing options...
dil_bert Posted February 12, 2018 Author Share Posted February 12, 2018 hello dear Psycho - many thanks for the quick reply I will do as adviced. greetings dil _ bert Quote Link to comment https://forums.phpfreaks.com/topic/306480-how-to-loop-over-600-pages/#findComment-1556330 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.