Jump to content

Recommended Posts

good day dear PHP-Freaks

 

 

 

i have a little scraper / parser that works very very well on one page. so far so good.

 

well now i have to extend this: for this little scraper / parser i have to visit 600 pages

 

 

here some musings :

on a first glance the issue about scraping from page to page - can be solved via different approaches:
we have the pagination on the bottom of the page: see for example:

http://mypage.com=&&page=5

and
 

 
http://mypage.com=&&page=6

and
 

http://mypage.com=&&page=7


and so forth and so forth..


well we can set this url (s) as a base -

if we have an array from which we load the urls that need to be visited - we would come across all the pages...

so we have approx 305 Pages that we have to visit. we can increment the pages (that are shown above) and count to the number of 305


some musings:  Hardcoding the total number of pages isn't practical as it could vary. we could:

- extract the number of results from the first page,
- extract the url from the "last" link at the bottom of the page, create a URI object and read the page number from the query string.

Note that I say round down above, because the query page number begins at 0, not 1.

Looping over could be as simple as:
 


my $url_pattern = 'https://mypage.com/..... &page=%s';
 
for my $page ( 0 .. $last )
{
    my $url = sprintf $url_pattern, $page;
     
    ...
}

Musings about solutions:

well  another solution: one would probably try to incorporate paging into the $conf, perhaps an iterator which upon each call fetches the next node, behind the scenes it automatically
increments the page when there are no nodes left until there are no pages left.
 

AND this solution is gathered from  https://perlmaven.com/simple-way-to-fetch-many-web-pages
Finally we arrived giving an example of downloading many pages using HTTP::Tiny


see the example:

 my $ht = HTTP::Tiny->new;
      
    foreach my $url (@urls) {
        say "Start $url";
        my $response = $ht->get($url);
        if ($response->{success}) {
            say 'Length: ', length $response->{content};
        } else {
            say "Failed: $response->{status} $response->{reasons}";
        }
    }

regarding the code:
The code is, quite straight forward. We have a list of URLs in the @urls array.  An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over
 each url and fetch it. In order to save space in this article I only printed the size of each page. This is the result: inally we arrived giving an example of downloading many pages using HTTP::Tiny.

 

Well - how would you solve this thing?!

 

Love to hear from you




 

Edited by dil_bert
Link to comment
https://forums.phpfreaks.com/topic/306480-how-to-loop-over-600-pages/
Share on other sites

Not know the context of what you are scraping, I can't say what solution I would follow. I have no experience with HTTP:Tiny, so I can't comment on that. The second option of interrogating the first page to get the total number and then creating an iterator sound logical. Another option in that same style would be to see what is returned when a page is requested that does not exist. E.g. if there are 305 pages and you were to request this page http://mypage.com=&&page=306, and see what is returned. Then create a loop that continually increments the page value until you get a response back matching a page that does not exist. Or, just continually get the next page as long as the current page contains the content that you expect it to have.

  • Like 1

hello dear Psycho.

 

many thanks for your posting with the ideas and tipps. 

 

note: the page i am getting the data from is the following:

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5


and
 

 
 
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6


and
 

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7

well one soultion could be the hardcoding solution - i put all the urls in an array.

OK, looking at the site, you can get the number of pages by going to the first page and inspecting the link for the "last »" link:

 

 

europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=288

 

Get that value then create a loop to iterate over the remaining pages.

Edited by Psycho
  • Like 1
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.