Jump to content

simplehtmldom vs Pyhon and BS 4 for a little parser-project


dil_bert

Recommended Posts

hello dear PHP-Freaks,

 

for a little programme i want to fetch the data of various plugins of Wordpress: to be concrete it is about 50 plugins that have each a domain - see below.the following data are needed: of the "Version", "Acitve installations" and "Tested up to:"

 

Question: I can use simplehtmldom or BS4 - which solution os more apropiate.

 

 

The project: for a list of wordpress-plugins: - approx 50 plugins are of interest!

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database and so on and so forth.

 

 

These plugins are listed in my favorites - so if i create a login with BS4 then i can log in and parse all those favorite-pages. The first approach: Otherwise i can loop through a set of URL to fetch all the necessary pages. I can use simplehtmldom or BS4 - which solution os more apropiate.

 

 

i need the data of the following three lines - in the above mentioned  example:

https://wordpress.or.../wp-job-manager

Quote

Version: <strong>1.29.3</strong>
Active installations: <strong>100,000+</strong>
Tested up to: <strong>4.9.4</strong>


possible solutions:

we can solve this task with other methods than ousing only BeautifulSoup, but we can do it for example with BS + regular expressions
assuming were able to do this with regular expression we need to locate the script tag in the HTML.
The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extracting
the above mentioned text:

But i guess that we can do this also with DOM-Parser

cf:  http://simplehtmldom.sourceforge.net/manual.htm

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';


again: i need the data of the following three lines - in the above mentioned  example:

https://wordpress.org/plugins/wp-job-manager


Version: <strong>1.29.3</strong>
Active installations: <strong>100,000+</strong>
Tested up to: <strong>4.9.4</strong>

 


How to create HTML DOM object?

$html = str_get_html('<html><body>Hello!</body></html>');

// Create a DOM object from a URL
$html = file_get_html('http://www.google.com/');

// Create a DOM object from a HTML file
$html = file_get_html('test.htm');

How to access the HTML element's attributes?

// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1);

// Find all <div> with the id attribute
$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');




How to traverse the DOM tree?

// Example
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
// or
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');



and ....


function my_callback($element) {
        // Hide all <b> tags
        if ($element->tag=='b')
                $element->outertext = '';
}

// Register the callback function with it's function name
$html->set_callback('my_callback');

// Callback function will be invoked while dumping
echo $html;
Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.