dil_bert Posted February 7, 2018 Share Posted February 7, 2018 (edited) hello dear PHP-Freaks, for a little programme i want to fetch the data of various plugins of Wordpress: to be concrete it is about 50 plugins that have each a domain - see below.the following data are needed: of the "Version", "Acitve installations" and "Tested up to:" Question: I can use simplehtmldom or BS4 - which solution os more apropiate. The project: for a list of wordpress-plugins: - approx 50 plugins are of interest!https://wordpress.org/plugins/wp-job-managerhttps://wordpress.org/plugins/ninja-formshttps://wordpress.org/plugins/participants-database and so on and so forth. These plugins are listed in my favorites - so if i create a login with BS4 then i can log in and parse all those favorite-pages. The first approach: Otherwise i can loop through a set of URL to fetch all the necessary pages. I can use simplehtmldom or BS4 - which solution os more apropiate. i need the data of the following three lines - in the above mentioned example:https://wordpress.or.../wp-job-manager Quote Version: <strong>1.29.3</strong>Active installations: <strong>100,000+</strong>Tested up to: <strong>4.9.4</strong> possible solutions:we can solve this task with other methods than ousing only BeautifulSoup, but we can do it for example with BS + regular expressionsassuming were able to do this with regular expression we need to locate the script tag in the HTML.The idea is to define a regular expression that would be used for both locating the element with BeautifulSoup and extractingthe above mentioned text:But i guess that we can do this also with DOM-Parsercf: http://simplehtmldom.sourceforge.net/manual.htm // Create DOM from URL or file $html = file_get_html('http://www.google.com/'); // Find all images foreach($html->find('img') as $element) echo $element->src . '<br>'; // Find all links foreach($html->find('a') as $element) echo $element->href . '<br>'; again: i need the data of the following three lines - in the above mentioned example:https://wordpress.org/plugins/wp-job-manager Version: <strong>1.29.3</strong>Active installations: <strong>100,000+</strong>Tested up to: <strong>4.9.4</strong> How to create HTML DOM object? $html = str_get_html('<html><body>Hello!</body></html>'); // Create a DOM object from a URL $html = file_get_html('http://www.google.com/'); // Create a DOM object from a HTML file $html = file_get_html('test.htm'); How to access the HTML element's attributes? // Find all anchors, returns a array of element objects $ret = $html->find('a'); // Find (N)th anchor, returns element object or null if not found (zero based) $ret = $html->find('a', 0); // Find lastest anchor, returns element object or null if not found (zero based) $ret = $html->find('a', -1); // Find all <div> with the id attribute $ret = $html->find('div[id]'); // Find all <div> which attribute id=foo $ret = $html->find('div[id=foo]'); How to traverse the DOM tree? // Example echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id; // or echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id'); and .... function my_callback($element) { // Hide all <b> tags if ($element->tag=='b') $element->outertext = ''; } // Register the callback function with it's function name $html->set_callback('my_callback'); // Callback function will be invoked while dumping echo $html; Edited February 7, 2018 by dil_bert Quote Link to comment https://forums.phpfreaks.com/topic/306462-simplehtmldom-vs-pyhon-and-bs-4-for-a-little-parser-project/ Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.