Jump to content

portiing over a Parser from BS4 to simplehtmldom-parser


dil_bert

Recommended Posts

hello dear Freaks 

 

i am currently  musing bout the portover of a python bs4 parser to php -  working with the simplehtmldom-parser / pr the DOM-selectors... (see below).

The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...

https://wordpress.org/plugins/participants-database ....and so on and so forth.  

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.
 

we have the following set of meta-data for each wordpress-plugin:

Version: 1.9.5.12 
installations: 10,000+    
WordPress Version: 5.0 or higher 
Tested up to: 5.4 PHP  
Version: 5.6 or higher    
Tags 3 Tags:databasemembersign-up formvolunteer
Last updated: 19 hours ago

 

the project consits of two parts:   the looping-part: (which seems to be pretty straightforward). the parser-part: where i have some issues - see below. I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below-

as a base i think it is good starting point to work from the following target-url: 

 plugins wordpress.org/plugins/browse/popular with 99 pages of content: cf ...
 wordpress.org/plugins/browse/popular/page/1 
wordpress.org/plugins/browse/popular/page/2
wordpress.org/plugins/browse/popular/page/99

 

the Output of text_nodes:

['Version: 1.9.5.12', 'Active installations: 10,000+', 'Tested up to: 5.6 ']  

but if we want to fetch the data of all the wordpress-plugins and subesquently sort them to show the -let us say - latest 50 updated plugins. This would be a interesting task:

 

first of all we need to fetch the urls

then we fetch the information and have to sort out the newest- the newest timestamp. Ie the plugin that updated most recently

List the 50 newest items - that are the 50 plugins that are updated recently ..

 

we have the following set
 

see here the Soup_ 

 soup = BeautifulSoup(r.content, 'html.parser')
        target = [item.get_text(strip=True, separator=" ") for item in soup.find(
            "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
        head = [soup.find("h1", class_="plugin-title").text]
        new = [x for x in target if x.startswith(
            ("V", "Las", "Ac", "W", "T", "P"))]
        return head + new


with ThreadPoolExecutor(max_workers=50) as executor1:
    futures1 = [executor1.submit(parser, url) for url in allin]

for future in futures1:
    print(future.result())

 

see the formal output

 

Quote


[lorem ipsum dolor sit amet', 'Version: 2.34.1', 'Last updated: 5 months ago', 'Tags: magna aliquyam erat, sed diam voluptua. At vero eos et accusam']
[consetetur sadipscing elitr', 'Version: 6.54.1', 'Last updated: 5 months ago', 'Tags: lorem ipsum dolor sit amet']
[sed diam nonumy eirmod tempor invidunt ut labore', 'Version: 7.16.1', 'Last updated: 5 months ago', 'Tags: tarifa, sevilla lisabin invidunt ut labore et dolore magna aliquyam erat']
[tempor invidunt ut taria malaga jerusalem labore', 'Version: 9.58.1', 'Last updated: 5 months ago', 'Tags: ilabore et lissabon dolore magna aliquyam erat']

 

background: https://stackoverflow.com/questions/61106309/fetching-multiple-urls-with-beautifulsoup-gathering-meta-data-in-wp-plugins

Well - i guess that we c an do this with the simple DOM Parser - here the seclector reference. 

https://stackoverflow.com/questions/1390568/how-can-i-match-on-an-attribute-that-contains-a-certain-string

 

look forward to any hint and help.

 

have a great day 

Edited by dil_bert
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.