how to loop over 600 pages -

dil_bert · February 9, 2018

good day dear PHP-Freaks

i have a little scraper / parser that works very very well on one page. so far so good.

well now i have to extend this: for this little scraper / parser i have to visit 600 pages

here some musings :

on a first glance the issue about scraping from page to page - can be solved via different approaches:
we have the pagination on the bottom of the page: see for example:

http://mypage.com=&&page=5

and

 
http://mypage.com=&&page=6

and

http://mypage.com=&&page=7

and so forth and so forth..

well we can set this url (s) as a base -

if we have an array from which we load the urls that need to be visited - we would come across all the pages...

so we have approx 305 Pages that we have to visit. we can increment the pages (that are shown above) and count to the number of 305

some musings: Hardcoding the total number of pages isn't practical as it could vary. we could:

- extract the number of results from the first page,
- extract the url from the "last" link at the bottom of the page, create a URI object and read the page number from the query string.

Note that I say round down above, because the query page number begins at 0, not 1.

Looping over could be as simple as:


my $url_pattern = 'https://mypage.com/..... &page=%s';
 
for my $page ( 0 .. $last )
{
    my $url = sprintf $url_pattern, $page;
     
    ...
}

Musings about solutions:

well another solution: one would probably try to incorporate paging into the $conf, perhaps an iterator which upon each call fetches the next node, behind the scenes it automatically
increments the page when there are no nodes left until there are no pages left.

AND this solution is gathered from https://perlmaven.com/simple-way-to-fetch-many-web-pages
Finally we arrived giving an example of downloading many pages using HTTP::Tiny

see the example:

 my $ht = HTTP::Tiny->new;
      
    foreach my $url (@urls) {
        say "Start $url";
        my $response = $ht->get($url);
        if ($response->{success}) {
            say 'Length: ', length $response->{content};
        } else {
            say "Failed: $response->{status} $response->{reasons}";
        }
    }

regarding the code:
The code is, quite straight forward. We have a list of URLs in the @urls array. An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over
each url and fetch it. In order to save space in this article I only printed the size of each page. This is the result: inally we arrived giving an example of downloading many pages using HTTP::Tiny.

Well - how would you solve this thing?!

Love to hear from you

Edited February 9, 2018 by dil_bert

Psycho · February 9, 2018

Not know the context of what you are scraping, I can't say what solution I would follow. I have no experience with HTTP:Tiny, so I can't comment on that. The second option of interrogating the first page to get the total number and then creating an iterator sound logical. Another option in that same style would be to see what is returned when a page is requested that does not exist. E.g. if there are 305 pages and you were to request this page http://mypage.com=&&page=306, and see what is returned. Then create a loop that continually increments the page value until you get a response back matching a page that does not exist. Or, just continually get the next page as long as the current page contains the content that you expect it to have.

dil_bert · February 11, 2018

hello dear Psycho.

many thanks for your posting with the ideas and tipps.

note: the page i am getting the data from is the following:

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5

and

 
 
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6

and

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7

well one soultion could be the hardcoding solution - i put all the urls in an array.

Psycho · February 12, 2018

OK, looking at the site, you can get the number of pages by going to the first page and inspecting the link for the "last »" link:

europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=288

Get that value then create a loop to iterate over the remaining pages.

Edited February 12, 2018 by Psycho

dil_bert · February 12, 2018

hello dear Psycho - many thanks for the quick reply

I will do as adviced.

greetings

dil _ bert

Sign In

how to loop over 600 pages -

Recommended Posts

dil_bert

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

dil_bert

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

dil_bert

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information