Jump to content

how to loop over 600 pages -


dil_bert

Recommended Posts

good day dear PHP-Freaks

 

 

 

i have a little scraper / parser that works very very well on one page. so far so good.

 

well now i have to extend this: for this little scraper / parser i have to visit 600 pages

 

 

here some musings :

on a first glance the issue about scraping from page to page - can be solved via different approaches:
we have the pagination on the bottom of the page: see for example:

http://mypage.com=&&page=5

and
 

 
http://mypage.com=&&page=6

and
 

http://mypage.com=&&page=7


and so forth and so forth..


well we can set this url (s) as a base -

if we have an array from which we load the urls that need to be visited - we would come across all the pages...

so we have approx 305 Pages that we have to visit. we can increment the pages (that are shown above) and count to the number of 305


some musings:  Hardcoding the total number of pages isn't practical as it could vary. we could:

- extract the number of results from the first page,
- extract the url from the "last" link at the bottom of the page, create a URI object and read the page number from the query string.

Note that I say round down above, because the query page number begins at 0, not 1.

Looping over could be as simple as:
 


my $url_pattern = 'https://mypage.com/..... &page=%s';
 
for my $page ( 0 .. $last )
{
    my $url = sprintf $url_pattern, $page;
     
    ...
}

Musings about solutions:

well  another solution: one would probably try to incorporate paging into the $conf, perhaps an iterator which upon each call fetches the next node, behind the scenes it automatically
increments the page when there are no nodes left until there are no pages left.
 

AND this solution is gathered from  https://perlmaven.com/simple-way-to-fetch-many-web-pages
Finally we arrived giving an example of downloading many pages using HTTP::Tiny


see the example:

 my $ht = HTTP::Tiny->new;
      
    foreach my $url (@urls) {
        say "Start $url";
        my $response = $ht->get($url);
        if ($response->{success}) {
            say 'Length: ', length $response->{content};
        } else {
            say "Failed: $response->{status} $response->{reasons}";
        }
    }

regarding the code:
The code is, quite straight forward. We have a list of URLs in the @urls array.  An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over
 each url and fetch it. In order to save space in this article I only printed the size of each page. This is the result: inally we arrived giving an example of downloading many pages using HTTP::Tiny.

 

Well - how would you solve this thing?!

 

Love to hear from you




 

Link to comment
Share on other sites

Not know the context of what you are scraping, I can't say what solution I would follow. I have no experience with HTTP:Tiny, so I can't comment on that. The second option of interrogating the first page to get the total number and then creating an iterator sound logical. Another option in that same style would be to see what is returned when a page is requested that does not exist. E.g. if there are 305 pages and you were to request this page http://mypage.com=&&page=306, and see what is returned. Then create a loop that continually increments the page value until you get a response back matching a page that does not exist. Or, just continually get the next page as long as the current page contains the content that you expect it to have.

Link to comment
Share on other sites

hello dear Psycho.

 

many thanks for your posting with the ideas and tipps. 

 

note: the page i am getting the data from is the following:

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5


and
 

 
 
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6


and
 

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7

well one soultion could be the hardcoding solution - i put all the urls in an array.

Link to comment
Share on other sites

OK, looking at the site, you can get the number of pages by going to the first page and inspecting the link for the "last »" link:

 

 

europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=288

 

Get that value then create a loop to iterate over the remaining pages.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.