Jump to content

simple_html_fom Versus DomDocument


Recommended Posts

Fellow Programmers,

 

General Php to extract Meta Tags

	<?php
	$meta_tags = get_meta_tags('http://www.example.com/');
	print_r($tags);
	?>
	 
	// Output:
	Array
	(
	    [keywords] => this is the keywords
	    [description] => this is the description
	)
	

 

DomDocument to extract Meta Tags

	<?php
	function file_get_contents_curl($url)
	{
	    $ch = curl_init();
	 
	    curl_setopt($ch, CURLOPT_HEADER, 0);
	    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	    curl_setopt($ch, CURLOPT_URL, $url);
	    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
	 
	    $data = curl_exec($ch);
	    curl_close($ch);
	 
	    return $data;
	}
	 
	$url = "'http://www.example.com/code";
	 
	$html = file_get_contents_curl($url);
	 
	// Load HTML to DOM object
	$doc = new DOMDocument();
	@$doc->loadHTML($html);
	 
	// Parse DOM to get Title data
	 
	$nodes = $doc->getElementsByTagName('title');
	$title = $nodes->item(0)->nodeValue;
	 
	// Parse DOM to get metadata
	$metas = $doc->getElementsByTagName('meta');
	 
	for ($i = 0; $i < $metas->length; $i++)
	{
	    $meta = $metas->item($i);
	    if($meta->getAttribute('name') == 'description')
	        $description = $meta->getAttribute('content');
	    if($meta->getAttribute('name') == 'keywords')
	        $keywords = $meta->getAttribute('content');
	}
	 
	echo "Title: $title". '<br/><br/>';
	echo "Description: $description". '<br/><br/>';
	echo "Keywords: $keywords";
	?>
	

Anyone can shorten the above without sacrificing on quality ?

Edited by TheStudent2023
Link to comment
Share on other sites

@requinix

 

I believe you have done web scraping before but with what ? DomDocument or simple_html_dom ?

Do you know how to scrape html form's dropdown options ?

Say, you want to scrape all the options from the dropdown you see here:

https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_select

How would you write the code with DomDocument and how would you with simple_html_dom ?

If you can be kind enough to show me these 2 then I will try learning them and then try myself to get writing code for a scraper to scrape options off from a radio button and from a checkbox.  Then, I can go deeper to scrape options from multi dropdowns and so on. Go deeper into the rabbit whole. As of now, I got no clue where to start. So, care to guide me ?

 

Link to comment
Share on other sites

@barand

 

Thanks a lot. I appreciate it. But unfortunately, nearly sunrise here and so I got to check your link out the next night.

In the meanwhile, care to show me 2 snippets (one using DomDoc, other using simple_html_dom) how to scrape inputs from a text input field (<input_type = 'text'>) from html form input field or from search box (like google) ?

Once I have learnt that from you, I will try scraping text inputs from blocktext (<textarea>).

Getting warmed up to learn scraping as it would make my programming easier to finish building the web crawler. Spider.

 

Thanks!

Edited by TheStudent2023
Link to comment
Share on other sites

@barand

 

Ok. This is how DomDocument to extract meta tags. Before, I look into your suggested link to learn how to extract meta tags using simple_html_dom parser, what do you think of the following code ?

	<?php
	$url = "
	$html = file_get_contents($url);
	// Initiate ability to manipulate the DOM and load that baby up
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);
libxml_clear_errors();
	// Fetch all <meta> tags
$meta_tags = $doc->getElementsByTagName('meta');
	if ($meta_tags->length > 0)
{
    foreach ($meta_tags as $tag)
    {
        // e.g. name="robots" and content="noindex"
        echo '<b>Meta Name: </b>' .$name = $tag->getAttribute('name'); echo '<br>';
        echo '<b>Meta Description: </b>' .$content = $tag->getAttribute('content');  echo '<br>';
    }
}
	

Can this be improved in quality and cut down on quanity (lines of code) ?

Do you see any errors in the code ?

Link to comment
Share on other sites

https://enb.iisd.org/_inc/simple_html_dom/manual/manual.htm#section_find

	// Find all anchors and images with the "title" attribute
$ret = $html->find('a[title], img[title]');
	

I do not understand the comment. What does that code do ?

How can an anchor have a "title" attribute ? Show me an example.

Same goes for an img with "title" attribute.

Edited by TheStudent2023
Link to comment
Share on other sites

@Barand

 

Thanks! I forgot about that tooltip. That it's called title. Infact, I wrote this code few weeks ago ....

	function item_submission_form_part_one()
{
    ?>
    <div style='font-family:verdana;font-size:15px;color:black;text-align:center;' name="item_submission_form" id="item_submission_form" align="center" size="50%">
    <form style="background-color:white;" method="POST" action="" name="submit_form_p1" id="submit_form_p1">
    <fieldset>
    <legend align="center"><h3 style="color:black;">Link Submission Form  - Part 1/3</h3></legend>
    <label for="product_type">Product Type:</label>
    <select name="product_type" id="product_type" title="Select product type">
    <option value=""></option>
    <option value="physical_product" <?php if(ISSET($_POST['product_type']) && !EMPTY($_POST['product_type']) && $_POST['product_type']=='physical product'){$product_type = $_POST['product_type']; echo 'selected';}?>>
    Physical Product</option>
    <option value="intangible product" <?php if(ISSET($_POST['product_type']) && !EMPTY($_POST['product_type']) && $_POST['product_type']=='intangible product'){$product_type = $_POST['product_type']; echo 'selected';}?>>
    Intangible Product</option>
    <option value="service" <?php if(ISSET($_POST['product_type']) && !EMPTY($_POST['product_type']) && $_POST['product_type']=='service'){$product_type = $_POST['product_type']; echo 'selected';}?>>
    Service</option>
    </select>
    <br>
    Listing Type:
    <input type="radio" name="listing_type" id="wanted" title="Check listing type" value="wanted" <?php if(ISSET($_POST['listing_type']) && !EMPTY($_POST['listing_type']) && $_POST['listing_type']=='wanted'){$listing_type = $_POST['listing_type']; echo 'checked';}?>>
    <label for="under_18">Wanted Item:</label>
    <input type="radio" name="listing_type" id="have" title="Check listing type" value="have" <?php if(ISSET($_POST['listing_type']) && !EMPTY($_POST['listing_type']) && $_POST['listing_type']=='have'){$listing_type = $_POST['listing_type']; echo 'checked';}?>>
    <label for="over_18">Have Item:</label>
    <br>
    <label for="item">Item</label>
    <input type="text" name="item" id="item" size="50" minlength="2" maxlength="255" title="Input your item" <?php if(ISSET($_POST['item']) && !EMPTY($_POST['item'])){$item = $_POST['item']; echo 'value="'.$item.'"';}else{echo 'placeholder="'.'Item ...'.'"';}?>>
    <br>
    <label for="manufacturer">Manufacturer</label>
    <input type="text" name="manufacturer" id="manufacturer" size="50" minlength="2" maxlength="255" title="Input your item manufacturer" <?php if(ISSET($_POST['manufacturer']) && !EMPTY($_POST['manufacturer'])){$manufacturer = $_POST['manufacturer']; echo 'value="'.$manufacturer.'"';}else{echo 'placeholder="'.'Manufacturer ...'.'"';}?>>
    <br>
    <label for="brand">Brand</label>
    <input type="text" name="brand" id="brand" size="50" minlength="2" maxlength="255" title="Input your item brand" <?php if(ISSET($_POST['brand']) && !EMPTY($_POST['brand'])){$brand = $_POST['brand']; echo 'value="'.$brand.'"';}else{echo 'placeholder="'.'Brand ...'.'"';}?>>
    <br>
    <label for="model">Model</label>
    <input type="text" name="model" id="model" size="50" minlength="2" maxlength="255" title="Input your item model" <?php if(ISSET($_POST['model']) && !EMPTY($_POST['model'])){$model = $_POST['model']; echo 'value="'.$model.'"';}else{echo 'placeholder="'.'Model ...'.'"';}?>>
    <br>
    <label for="serial_number">Serial Number</label>
    <input type="text" name="serial_number" id="serial_number" size="50" minlength="2" maxlength="255" title="Input your item serial_number" <?php if(ISSET($_POST['serial_number']) && !EMPTY($_POST['serial_number'])){$serial_number = $_POST['serial_number']; echo 'value="'.$serial_number.'"';}else{echo 'placeholder="'.'Serial Number ...'.'"';}?>>
    <br>
    <label for="year">Year</label>
    <input type="text" name="year" id="year" size="50" minlength="2" maxlength="255" title="Input your item year" <?php if(ISSET($_POST['year']) && !EMPTY($_POST['year'])){$year = $_POST['year']; echo 'value="'.$year.'"';}else{echo 'placeholder="'.'Year ...'.'"';}?>>
    <br>
    <label for="currency">Currency</label>
    <input type="text" name="currency" id="currency" size="50" minlength="2" maxlength="255" title="Input your item currency" <?php if(ISSET($_POST['currency']) && !EMPTY($_POST['currency'])){$currency = $_POST['currency']; echo 'value="'.$currency.'"';}else{echo 'placeholder="'.'Currency ...'.'"';}?>>
    <br>
    <label for="price">Price</label>
    <input type="text" name="price" id="price" size="50" minlength="2" maxlength="255" title="Input your item price" <?php if(ISSET($_POST['price']) && !EMPTY($_POST['price'])){$price = $_POST['price']; echo 'value="'.$price.'"';}else{echo 'placeholder="'.'Price ...'.'"';}?>>
    <br>
    <label for="title">Title</label>
    <input type="text" name="title" id="title" size="50" minlength="2" maxlength="255" title="Input your product page's title" <?php if(ISSET($_POST['title']) && !EMPTY($_POST['title'])){$title = $_POST['title']; echo 'value="'.$title.'"';}else{echo 'placeholder="'.'product page title'.'"';}?>>
    <br>
    </fieldset>
    <fieldset>
    <button type="submit" name="submit_button_1" id="submit_button_1" title="Submit Form - Part 1/3">submit - Part 1/3</button>
    </fieldset>
    </form>
    </div>
    <?php
}
	

 

Anyways, that dom parser is not showing me how to extract meta tags and page titles.

The closest I found a match is this:

	[attribute]Matches elements that have the specified attribute.
	

Under the attribute filters tab. But that is not helpful to extract meta tags and page titles. Do you know where in the parser manual it teaches what I looking for ?

https://enb.iisd.org/_inc/simple_html_dom/manual/manual.htm#section_find

Or better, if you know, then show me the code snippets. And show me where in the doc you found the code snippet. Must extract using simple_html_dom parser.

And can you check my DomDocument code above ?

 

Thanks!

Edited by TheStudent2023
Link to comment
Share on other sites

@kicken

 

Tonight, can you teach me how to read simple_html_dom() parser syntax ?

I want to extract page title.

On this page:

https://stackoverflow.com/questions/11385774/how-to-extract-title-and-meta-description-using-php-simple-html-dom-parser

I found 4 different programmers showing 4 different ways to code using the simple_html_dom() parser. Look:

1

	$meta_title = $html->find("meta[name='title']", 0)->content;
	

 

2

	$title = $html->find('title',0)->innertext;
	

 

3

	$title = array_shift($html->find('title'))->innertext;
	

 

4

	$title = $html->load('title')->simpletext; //<title>**Text from here**</title>
	

 

Q1.

But where did they find these syntaxes in the parser's manual ? I cannot find any of them! Check the mini doc:

https://enb.iisd.org/_inc/simple_html_dom/manual/manual.htm#section_find

It seems I am missing where they are looking. have to learn to look in the right direction. So, I need your assistance again, I'm afraid.

 

Q2.

From your experience, can you rank these 4 codes where best is on top ? ANd let me know why you ranked the way you did. This should teach me to spot best and effective coding practice.

Thanks!

Edited by TheStudent2023
Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.