Jump to content

simplehtmldom vs Pyhon and BS 4 for a little parser-project


dil_bert

Recommended Posts

hello dear PHP-Freaks,

 

for a little programme i want to fetch the data of various plugins of Wordpress: to be concrete it is about 50 plugins that have each a domain - see below.the following data are needed: of the "Version", "Acitve installations" and "Tested up to:"

 

Question: I can use simplehtmldom or BS4 - which solution os more apropiate.

 

 

The project: for a list of wordpress-plugins: - approx 50 plugins are of interest!

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database and so on and so forth.

 

 

These plugins are listed in my favorites - so if i create a login with BS4 then i can log in and parse all those favorite-pages. The first approach: Otherwise i can loop through a set of URL to fetch all the necessary pages. I can use simplehtmldom or BS4 - which solution os more apropiate.

 

 

i need the data of the following three lines - in the above mentioned  example:

https://wordpress.or.../wp-job-manager

Quote

Version: <strong>1.29.3</strong>
Active installations: <strong>100,000+</strong>
Tested up to: <strong>4.9.4</strong>


possible solutions:

we can solve this task with other methods than ousing only BeautifulSoup, but we can do it for example with BS + regular expressions
assuming were able to do this with regular expression we need to locate the script tag in the HTML.
The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting
the above mentioned text:

But i guess that we can do this also with DOM-Parser

cf:  http://simplehtmldom.sourceforge.net/manual.htm

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';


again: i need the data of the following three lines - in the above mentioned  example:

https://wordpress.org/plugins/wp-job-manager


Version: <strong>1.29.3</strong>
Active installations: <strong>100,000+</strong>
Tested up to: <strong>4.9.4</strong>

 


How to create HTML DOM object?

$html = str_get_html('<html><body>Hello!</body></html>');

// Create a DOM object from a URL
$html = file_get_html('http://www.google.com/');

// Create a DOM object from a HTML file
$html = file_get_html('test.htm');

How to access the HTML element's attributes?

// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1);

// Find all <div> with the id attribute
$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');




How to traverse the DOM tree?

// Example
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
// or
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');



and ....


function my_callback($element) {
        // Hide all <b> tags
        if ($element->tag=='b')
                $element->outertext = '';
}

// Register the callback function with it's function name
$html->set_callback('my_callback');

// Callback function will be invoked while dumping
echo $html;
Edited by dil_bert
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.