PHP Simple HTML DOM Parser fails on a simple example - [driving me nuts]

dilbertone · December 5, 2010

Hi everyone,

I'm trying to select either a class or an id using PHP Simple HTML DOM Parser with absolutely no luck. My example is very simple and seems to comply to the examples given in the manual(http://simplehtmldom.sourceforge.net/manual.htm) but it just wont work, it's driving me up the wall.

Here is my example: http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=94468&lschb=

I think the HTML is invalid: i cannot parse it.

Well i need more examples - probly i have overseen something!

If anybody has a working example of Simple-html-dom-parser...i would be happy.

The examples on the developersite are not very helpful.

your dilbertone

BlueSkyIS · December 5, 2010

can you show us your code??

BlueSkyIS · December 5, 2010

also: i haven't tried parsing websites with Simple HTML DOM Parser. i just use preg_match. it seems that page you are trying to parse is very simple, even providing a label just before each piece of data, Schule:, Straße:, etc.

if i was going to parse that page, i would take all content between the table tags, strip off the first Table Row, then strip_tags to leave only the label: value pairs in plain text. then i'd loop over the lines, matching each label: with it's value.

dilbertone · December 5, 2010

Hi BlueSkyIs many thanks for posting.

also: i haven't tried parsing websites with Simple HTML DOM Parser. i just use preg_match. it seems that page you are trying to parse is very simple, even providing a label just before each piece of data, Schule:, Straße:, etc.

if i was going to parse that page, i would take all content between the table tags, strip off the first Table Row, then strip_tags to leave only the label: value pairs in plain text. then i'd loop over the lines, matching each label: with it's value.

you would do it with regex - i am not very familiar with regex. Note - i tried to work with Simple HTML-DOM-Parser to get all the content within the table. Well - it failed...

Can you give me a helping hand and give me some starting points with the regex-approach!? That would be great!!

PS - my trials only spits out the e-mail-adress: i had other trials that tried to get the whole class - but without any luck!

see here the code snippet!


<?php 

include('simple_html_dom.php');

// Create DOM from URL or file

$html = file_get_html('nrw_test.html');

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href. '<br>';

?>

well - i would love to see the snippet that you were successful with!

greetings dilbertone

PS - my friends allways say: PHP regular expressions seems to be a quite complicated area especially if you are not an experienced Unix user. So i think it have to get started with this....technique.

QuickOldCar · December 5, 2010

That is interesting, my page parser can't detect the href links as well.

http://get.blogdns.com/dynaindex/page-parser?domainname=http%3A%2F%2Fschulnetz.nibis.de%2Fdb%2Fschulen%2Fschule.php%3Fschulnr%3D94468%26lschb

QuickOldCar · December 5, 2010

I got it working for that, some their links work as ../, so I added a rule to include the parsed host in front of those type links. I already had for / and if was no http present in front of the href url.

So I now added the ./ and../ as well. And one day will need to do more.

Parsing links from many places takes a lot more than that simple code you have.

http://get.blogdns.com/dynaindex/page-parser?domainname=http%3A%2F%2Fschulnetz.nibis.de%2Fdb%2Fschulen%2Fschule.php%3Fschulnr%3D94468%26lschb

dilbertone · December 5, 2010

hello QuickOldCar,

many thanks for the help. Great to see this results.

I got it working for that, some their links work as ../, so I added a rule to include the parsed host in front of those type links. I already had for / and if was no http present in front of the href url.

So I now added the ./ and../ as well. And one day will need to do more.

Parsing links from many places takes a lot more than that simple code you have.

http://get.blogdns.com/dynaindex/page-parser?domainname=http%3A%2F%2Fschulnetz.nibis.de%2Fdb%2Fschulen%2Fschule.php%3Fschulnr%3D94468%26lschb

i am triying to figure out what went wrong here... at the end of the day - i want to parse all the content. - Stripping the tags and getting all the values to the labels...

Do you want to share your code?

Look forward to hear from you....

QuickOldCar · December 5, 2010

Let me rephrase all what needs to be done.

You need to make a link checking system.

For every href discovered must do substring checks, if follows a regular format such as http://, www.,ftp:// and so on you keep the link as it is, else you do more checks if is anything odd at beginning of the link have to trim it off. Then use a parsed host and place the parsed host at the beginning to the href. Also add the end slash for the parsed host.

The parser I made does all this by curl to find the resolved pages.

Dom to find the href links.

A very complex parse for the host that does host, main host, can handle any queries and also the second level domains as well.

I place them all in arrays and a loop.

I still didn't get that sites logo image yet it was using ../../../, I added some rules but didn't quite get it right yet.

QuickOldCar · December 5, 2010

For a very simple parse host is this code and apply your html dom above this to find all the href links, I just wrote this up inside the comment, so I hope I got it right, but will show you what need to do.

<?php

function getHost($url) {
            $parseUrl = parse_url(trim($url));
            return trim($parseUrl[host] ? $parseUrl[host] : array_shift(explode('/', $parseUrl[path], 2)));
        }

//Usage:
//href's from dom
$href_link = "value from dom element";
//parse the url host
$parsed_url = getHost($url);
//add http:// and end slash to parsed host for to be a href link again
$http_parsed_host = "http://$parsed_url/";

//check for some common href beginnings, if there leave link alone, else modify it.
if ((substr($href_link, 0,  == "https://") OR (substr($href_link, 0, 12) == "https://www.") OR (substr($href_link, 0, 7) == "http://") OR (substr($href_link, 0, 11) == "http://www.") OR (substr($href_link, 0, 4) == "www.") OR (substr($href_link, 0, 6) == "ftp://")  OR (substr($href_link, 0, 11) == "feed://www.")OR (substr($href_link, 0, 7) == "feed://")) {

         $final_href_link[] = $href_link; {

} else {
if ((substr($href_link, 0, 1) == "/")) {
  $href_link = ltrim($href_link, "/");
}
$href_links_input .= str_replace( = array("./","../","../../","../../../"), '', $href_link);
                $final_link = "$http_parsed_url$href_links_input";
                $final_href_link[] = $final_link;
}
$links_array = array_unique($final_href_link);
sort($links_array);
foreach ($links_array as $links) {

//echo "$links<br />";
echo "<a href='$links'>$links</a><br />";
}
?>

dilbertone · December 5, 2010

hello QuickOldCar,

many many thanks for the quick reply!

For a very simple parse host is this code and apply your html dom above this to find all the href links, I just wrote this up inside the comment, so I hope I got it right, but will show you what need to do.

<?php

function getHost($url) {
            $parseUrl = parse_url(trim($url));
            return trim($parseUrl[host] ? $parseUrl[host] : array_shift(explode('/', $parseUrl[path], 2)));
        }

//Usage:
//href's from dom
$href_link = "value from dom element";
//parse the url host
$parsed_url = getHost($url);
//add http:// and end slash to parsed host for to be a href link again
$http_parsed_host = "http://$parsed_url/";

//check for some common href beginnings, if there leave link alone, else modify it.
if ((substr($href_link, 0,  == "https://") OR (substr($href_link, 0, 12) == "https://www.") OR (substr($href_link, 0, 7) == "http://") OR (substr($href_link, 0, 11) == "http://www.") OR (substr($href_link, 0, 4) == "www.") OR (substr($href_link, 0, 6) == "ftp://")  OR (substr($href_link, 0, 11) == "feed://www.")OR (substr($href_link, 0, 7) == "feed://")) {

         $final_href_link[] = $href_link; {

} else {
if ((substr($href_link, 0, 1) == "/")) {
  $href_link = ltrim($href_link, "/");
}
$href_links_input .= str_replace( = array("./","../","../../","../../../"), '', $href_link);
                $final_link = "$http_parsed_url$href_links_input";
                $final_href_link[] = $final_link;
}
$links_array = array_unique($final_href_link);
sort($links_array);
foreach ($links_array as $links) {

//echo "$links<br />";
echo "<a href='$links'>$links</a><br />";
}
?>

this is more than expected - i try to figure out what your code does. It is great!

Note - i only wanted to get the 6 or seven values for the labels out of the example. Your code does much much more...

Many thanks - this gives me a great starting point!

greetings dilbertone

QuickOldCar · December 6, 2010

I wasn't happy with anything I did prior as it had complications still.

So I sat down and really thought this out, I came up with a better working code.

I set a url get value, so try the links something like my site with the url=.

http://get.blogdns.com/dynaindex/simple-parse?url=http://www.phpfreaks.com/forums/index.php

simple_html_dom.php set to same folder

<?php
include('simple_html_dom.php');

function getHost($url) {
            $parseUrl = parse_url(trim($url));
            return trim($parseUrl[host] ? $parseUrl[host] : array_shift(explode('/', $parseUrl[path], 2)));
        }

$url = mysql_real_escape_string($_GET['url']);
//simple way to add the http:// that dom requires, using curl is a better option
if (substr($url, 0, 4) != "http") {
$url = "http://$url";
}

$parsed_url = getHost($url);

$http_parsed_host = "http://$parsed_url/";
$html = file_get_html($url);

foreach($html->find('a') as $element) 

$dom = new DOMDocument();
@$dom->loadHTML($html);


$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$href_link = $href->getAttribute('href');

$parse_count = count("$http_parsed_host");
$substr_count = +7;


if (substr($href_link, 0, $substr_count) == "mailto:") {
$mail_link = $href_link;
$href_link = trim($mail_link,$href_link);

}	

if (substr($href_link, 0, 1) == "/") {
$href_link = trim($href_link,"/");

}   

if (substr($href_link, 0, 2) == "//") {
$href_link = trim($href_link,"//");

} 

if (substr($href_link, 0, 3) == "///") {
$href_link = trim($href_link,"///");

}  


if ((substr($href_link, 0,  == "https://") OR (substr($href_link, 0, 12) == "https://www.") OR (substr($href_link, 0, 7) == "http://") OR (substr($href_link, 0, 11) == "http://www.") OR (substr($href_link, 0, 6) == "ftp://")  OR (substr($href_link, 0, 11) == "feed://www.") OR (substr($href_link, 0, 7) == "feed://")) {

         $final_href_link[] = $href_link;
  

} else {

  if (substr($href_link, 0, 1) != "/") {
$final_href_link[] = "$http_parsed_host$href_link";
}

}             
}
$links_array = array_unique($final_href_link);
sort($links_array);
foreach ($links_array as $links) {

//echo "$links<br />";
echo "<a href='$links'>$links</a><br />";

}
echo "<a href='$mail_link'>$mail_link</a><br />";

?>

Some other thoughts, you would be able to look at the endings of the href_links and sort them by type, such as images in an array of jpg,jpeg,bmp,png,gif, even audio or video types.

dilbertone · December 6, 2010

Hello QuickOldCar,

great to hear from you again! Overwhelming! I am very happy - you give me many hints and a great learning curve to dive into PHP programming.

I like it very very much that you refer to PHP Simple HTML DOM Parser. That is great. I have PHP Simple HTML DOM Parser running here.

Well i try to get some insgihts into your code - it is great - and has some non-trivial assets. I try to apply it on the target - this very simple looking site, which i want to parse:

http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=35877&lschb=

Note: this five or six lables to parse is all i want!! And if we can do it with the PHP Simple Html DOM Parser i am happy.

QuickOldCar, you are a great coder and i like this introduction that takes me into this great technique.

Did i apply the url at the right position!?? i am not very sure!?

<?php
include('simple_html_dom.php');

function getHost($url) {
            $parseUrl = parse_url(trim($url));
            return trim($parseUrl[host] ? $parseUrl[host] : array_shift(explode('/', $parseUrl[path], 2)));
        }

$url = mysql_real_escape_string($_GET['http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=35877&lschb=']);
//simple way to add the http:// that dom requires, using curl is a better option
if (substr($url, 0, 4) != "http") {
$url = "http://$url";
}

$parsed_url = getHost($url);

$http_parsed_host = "http://$parsed_url/";
$html = file_get_html($url);

foreach($html->find('a') as $element) 

$dom = new DOMDocument();
@$dom->loadHTML($html);


$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {



$href = $hrefs->item($i);



$href_link = $href->getAttribute('href');




$parse_count = count("$http_parsed_host");
$substr_count = +7;


if (substr($href_link, 0, $substr_count) == "mailto:") {
$mail_link = $href_link;
$href_link = trim($mail_link,$href_link);

}





if (substr($href_link, 0, 1) == "/") {
$href_link = trim($href_link,"/");

}   

if (substr($href_link, 0, 2) == "//") {
$href_link = trim($href_link,"//");

} 

if (substr($href_link, 0, 3) == "///") {
$href_link = trim($href_link,"///");

}  


if ((substr($href_link, 0,  == "https://") OR (substr($href_link, 0, 12) == "https://www.") OR (substr($href_link, 0, 7) == "http://") OR (substr($href_link, 0, 11) == "http://www.") OR (substr($href_link, 0, 6) == "ftp://")  OR (substr($href_link, 0, 11) == "feed://www.") OR (substr($href_link, 0, 7) == "feed://")) {

         $final_href_link[] = $href_link;
  

} else {

  if (substr($href_link, 0, 1) != "/") {
$final_href_link[] = "$http_parsed_host$href_link";
}

}             
}
$links_array = array_unique($final_href_link);
sort($links_array);
foreach ($links_array as $links) {

//echo "$links<br />";
echo "<a href='$links'>$links</a><br />";

}
echo "<a href='$mail_link'>$mail_link</a><br />";

?>

love to hear from you...

greetings

dilbertone

QuickOldCar · December 7, 2010

As for the first code I did.

This line:

$final_href_link[] = $href_link; {

Should be this:

$final_href_link[] = $href_link;

But the last code works much better.

Thanks for the kind words, and I was just letting you know within the code I wrote that I had it that simple_dom file was in the same folder. You can place it anywhere that can be accessed. I just find it easier if was maybe all in an include folder in root, or in the same folder as what you are doing.

QuickOldCar · December 7, 2010

Ha ha, Sorry I missed about what you said about the url.

The below code will do as you need.

I had

$url = mysql_real_escape_string($_GET['url']);

That was so you would be able to do any url with a get request , The links would have looked something like http://mysite.com/simple-parse.php?url=http://somesite.com

So the below code all you have to do is visit the page or do an include with the site specified.

<?php
include('simple_html_dom.php');

function getHost($url) {
            $parseUrl = parse_url(trim($url));
            return trim($parseUrl[host] ? $parseUrl[host] : array_shift(explode('/', $parseUrl[path], 2)));
        }

$url = "http://schulnetz.nibis.de/db/schulen/schule.php?schulnr=35877&lschb=";
//simple way to add the http:// that dom requires, using curl is a better option
if (substr($url, 0, 4) != "http") {
$url = "http://$url";
}

$parsed_url = getHost($url);

$http_parsed_host = "http://$parsed_url/";
$html = file_get_html($url);

foreach($html->find('a') as $element) 

$dom = new DOMDocument();
@$dom->loadHTML($html);


$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {



$href = $hrefs->item($i);



$href_link = $href->getAttribute('href');




$parse_count = count("$http_parsed_host");
$substr_count = +7;


if (substr($href_link, 0, $substr_count) == "mailto:") {
$mail_link = $href_link;
$href_link = trim($mail_link,$href_link);

}





if (substr($href_link, 0, 1) == "/") {
$href_link = trim($href_link,"/");

}   

if (substr($href_link, 0, 2) == "//") {
$href_link = trim($href_link,"//");

} 

if (substr($href_link, 0, 3) == "///") {
$href_link = trim($href_link,"///");

}  


if ((substr($href_link, 0,  == "https://") OR (substr($href_link, 0, 12) == "https://www.") OR (substr($href_link, 0, 7) == "http://") OR (substr($href_link, 0, 11) == "http://www.") OR (substr($href_link, 0, 6) == "ftp://")  OR (substr($href_link, 0, 11) == "feed://www.") OR (substr($href_link, 0, 7) == "feed://")) {

         $final_href_link[] = $href_link;
  

} else {

  if (substr($href_link, 0, 1) != "/") {
$final_href_link[] = "$http_parsed_host$href_link";
}

}             
}
$links_array = array_unique($final_href_link);
sort($links_array);
foreach ($links_array as $links) {

//echo "$links<br />";
echo "<a href='$links'>$links</a><br />";

}
echo "<a href='$mail_link'>$mail_link</a><br />";

?>

dilbertone · December 7, 2010

hi Quickoldcar -

you are the man of the year!! Many thanks!!

Wow: did i get you right: i can access any url an get some results!?

That sounds great.

I will testrun this code later today. i come back later today and report all my findings

many thanks for all so far!

greetings

dilbertone

Sign In

PHP Simple HTML DOM Parser fails on a simple example - [driving me nuts]

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information