simplehtmldom vs Pyhon and BS 4 for a little parser-project

dil_bert · February 7, 2018

hello dear PHP-Freaks,

for a little programme i want to fetch the data of various plugins of Wordpress: to be concrete it is about 50 plugins that have each a domain - see below.the following data are needed: of the "Version", "Acitve installations" and "Tested up to:"

Question: I can use simplehtmldom or BS4 - which solution os more apropiate.

The project: for a list of wordpress-plugins: - approx 50 plugins are of interest!

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database and so on and so forth.

These plugins are listed in my favorites - so if i create a login with BS4 then i can log in and parse all those favorite-pages. The first approach: Otherwise i can loop through a set of URL to fetch all the necessary pages. I can use simplehtmldom or BS4 - which solution os more apropiate.

i need the data of the following three lines - in the above mentioned example:

https://wordpress.or.../wp-job-manager

Quote

Version: 1.29.3
Active installations: 100,000+
Tested up to: 4.9.4

possible solutions:

we can solve this task with other methods than ousing only BeautifulSoup, but we can do it for example with BS + regular expressions
assuming were able to do this with regular expression we need to locate the script tag in the HTML.
The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting
the above mentioned text:

But i guess that we can do this also with DOM-Parser

cf: http://simplehtmldom.sourceforge.net/manual.htm

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

again: i need the data of the following three lines - in the above mentioned example:

https://wordpress.org/plugins/wp-job-manager

Version: 1.29.3
Active installations: 100,000+
Tested up to: 4.9.4

How to create HTML DOM object?

$html = str_get_html('<html><body>Hello!</body></html>');

// Create a DOM object from a URL
$html = file_get_html('http://www.google.com/');

// Create a DOM object from a HTML file
$html = file_get_html('test.htm');

How to access the HTML element's attributes?

// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1);

// Find all <div> with the id attribute
$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');

How to traverse the DOM tree?

// Example
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
// or
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');

and ....


function my_callback($element) {
        // Hide all <b> tags
        if ($element->tag=='b')
                $element->outertext = '';
}

// Register the callback function with it's function name
$html->set_callback('my_callback');

// Callback function will be invoked while dumping
echo $html;

Edited February 7, 2018 by dil_bert

Sign In

simplehtmldom vs Pyhon and BS 4 for a little parser-project

Recommended Posts

dil_bert

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information