N-Bomb(Nerd) Posted June 15, 2009 Share Posted June 15, 2009 I've never in my life before used DOM (In fact I've never parsed an XML file or even used javascript for that mater), but I'm having quite a bit of trouble DOM at the moment. I'm trying to extract all of the contents within a div element. However, there is a ton more divs within the one I'm trying to extract. Lucky the element I'm trying to extract has a unique id(e7593), so that shouldn't be a problem. Unfortunately, I have no idea whatsoever as to what I'm doing. I've looked at examples and I've tried studying about this, however I can't quite figure out anything that works in my case. Everything I've read is more focused on XML, when in fact I'm dealing with html.. If someone could be as kind to help me to some resources that may be of use? Or.. if I'm lucky enough ( and your quite bored ) could someone perhaps show me how to accomplish what I need to do step by step? Sure, giving me code without explaining is always accepted, but I wouldn't mind learning so I wouldn't have to bugger everyone. Also, did I mention my very attractive sister (who is available) finds it very attractive when people help me learn about DOM? Quote Link to comment Share on other sites More sharing options...
rhodesa Posted June 15, 2009 Share Posted June 15, 2009 try this: <?php $xml = simplexml_load_file('test.html'); //Load the HTML into SimpleXML $nodes = $xml->xpath('//div[@id="e7593"]'); //Find the DIV with id e7593 $contents = ""; foreach($nodes[0]->children() as $child){ //Loop over that node's children and add them to $contents $contents .= $child->asXML()."\n"; } echo $contents; ?> Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 15, 2009 Author Share Posted June 15, 2009 I seem to be getting this error with the above code Fatal error: Cannot use object of type DOMNodeList as array in test.php on line 42 Whoops that was the wrong attempt.. lol the correct errors are: Warning: simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity and: Fatal error: Call to a member function xpath() on a non-object in test.php on line 36 Quote Link to comment Share on other sites More sharing options...
rhodesa Posted June 15, 2009 Share Posted June 15, 2009 what does the HTML look like you are trying to parse? Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 15, 2009 Author Share Posted June 15, 2009 what does the HTML look like you are trying to parse? I accidentally posted the wrong error, I edited my above post. Quote Link to comment Share on other sites More sharing options...
rhodesa Posted June 15, 2009 Share Posted June 15, 2009 where is the HTML you are trying to load? a local file or a remote page? what did you change test.html to for your tests? Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 15, 2009 Author Share Posted June 15, 2009 where is the HTML you are trying to load? a local file or a remote page? what did you change test.html to for your tests? Actually at first I was trying to do a remote page, then I saved the file and got a BUNCH of parse errors.. for example: Warning: simplexml_load_file() [function.simplexml-load-file]: test.html:138: parser error : Opening and ending tag mismatch: img line 138 and a in test.php on line 40 Quote Link to comment Share on other sites More sharing options...
rhodesa Posted June 15, 2009 Share Posted June 15, 2009 k, unfortunately you won't be able to use DOM then. DOM is VERY strict, and doesn't allow for small errors here and there (unlike a browser). It would be great if all web developers out there had strict XHTML, but it's just not true. so, instead we'll have to use REGEX...do you have a link to the HTML page you are parsing? If not, can you post either all the HTML or at least the code you want with a few extra lines before and after? Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 15, 2009 Author Share Posted June 15, 2009 k, unfortunately you won't be able to use DOM then. DOM is VERY strict, and doesn't allow for small errors here and there (unlike a browser). It would be great if all web developers out there had strict XHTML, but it's just not true. so, instead we'll have to use REGEX...do you have a link to the HTML page you are parsing? If not, can you post either all the HTML or at least the code you want with a few extra lines before and after? Well see therein lies the problem.. the divs I'm trying to acquire are not fixed values. It's mostly user submitted data from a website, and each time the data within the div is going to be random submitted data.. along with code outside of the div. The only fixed data involved with the div is it's unique id. Would still REGEX still be a possibility given the above information? Quote Link to comment Share on other sites More sharing options...
RichardRotterdam Posted June 15, 2009 Share Posted June 15, 2009 Maybe this is a longshot but could it be that you're using an html 4 doctype? HTML4 isn't XML but SGML so it would be logical if that does'nt parse using the simplexml functions I did a quick test using DOMElement instead of simplexml You might want to try that. Here is the code test I did which might be usefull to you. <?php // put some html into the string $html $html=<<<HTML <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf8"> <title>Insert title here</title> </head> <body> <br> <div id="anassfullofninjas"><br><img src="someimage.jpg"></div> </body> HTML; // create dom element $doc = new DOMDocument(); $doc->validateOnParse = true; $doc->loadHTML($html); /** *get innerHTML of node */ $innerHTML = ''; $elem = $doc->getElementById('anassfullofninjas'); // loop through all childNodes, getting html $children = $elem->childNodes; foreach ($children as $child) { $tmp_doc = new DOMDocument(); $tmp_doc->appendChild($tmp_doc->importNode($child,true)); $innerHTML .= $tmp_doc->saveHTML(); } echo $innerHTML; Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 16, 2009 Author Share Posted June 16, 2009 Maybe this is a longshot but could it be that you're using an html 4 doctype? HTML4 isn't XML but SGML so it would be logical if that does'nt parse using the simplexml functions I did a quick test using DOMElement instead of simplexml You might want to try that. [Code..] [/quote] Nice, we're making some progress now. After a bit of testing, I've discovered that the variable "$html" has to have the actual html code in it there for me to use it. I'm getting content off of a website that isn't mine and I've used curl to read the website into a variable. However, when I plug that variable in on the "$doc->loadHTML($html);" I get the following errors: Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 418 in dom.php on line 26 Which would be this line "$doc->loadHTML($page);", which "$page" is coming from the curl execution I've previously done. Warning: Invalid argument supplied for foreach() in dom.php on line 32 Not much more to stay about that error aye? It's exactly what you posted in your code. However, when I save the source of the website ($page) to a local file and attempt to parse it I get the following: Warning: Invalid argument supplied for foreach() in dom.php on line 32 However, when I have the html code to parse inside of the script already like you provided, it works.. Trying to figure out why this is so.. Edit: does it matter if the div I'm looking for is already nested instead of other fields? Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 16, 2009 Author Share Posted June 16, 2009 To continue my edit ( accidentally hit submit and it wouldn't let me edit again.. ).. Does it matter if the div I'm looking for is already nested instead of other fields? Because when I try to recreate some of the code (use the same format as you did with the $html ) within the document to test it in there.. it's very basic and the divs aren't nested. -- After some more testing I actually took a whole div from the website and put it inside of the script ( just like you had done ) and tried running it and here's the output I got: Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 12 in dom.php on line 60 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 12 in dom.php on line 60 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 27 in dom.php on line 60 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 28 in dom.php on line 60 However, below all of that the desired output was outputted.. When I take $html out of the script though and try to load the external version ( still local ) of $html I only get this: Warning: Invalid argument supplied for foreach() in dom.php on line 67 With this I don't even get the desired output below the error though.. Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 16, 2009 Author Share Posted June 16, 2009 Further more, after more testing I've figured out some more useful information I believe. When I recreate the local $html into it's own separate file using the html source of the website stripped with the basic html,body tags with the single div (whole div from the website with all content in it) inside it works when I suppress the errors using "@" on the loadHTML. When I don't suppress the errors I get the 4 "DOMDocument::loadHTML()" errors in the post above.. but the output is there. By suppressing it I get the desired outcome. However, when I copy and paste the complete source of the website into the external ( still local ) file to parse I get a whole bunch of the "DOMDocument::loadHTML()" errors, and at the bottom it reads: Warning: Invalid argument supplied for foreach() in dom.php on line 69 How could simply adding more source code to it stop my desired output and give me a ton more errors and a "foreach()" error? Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 16, 2009 Author Share Posted June 16, 2009 Well, I've confused the heck out of myself.. I highly doubt anyone would be able to read and understand what I said above. Quote Link to comment Share on other sites More sharing options...
N-Bomb(Nerd) Posted June 16, 2009 Author Share Posted June 16, 2009 Well, I've figured out my problem.. the website I was trying to extract the div from didn't have a DOCTYPE.. simply put a DOCTYPE before the rest of the websites code and it parsed just fine ( errors suppressed of course).. Many hours wasted trying to figure out this problem when it was as simple as that.. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.