Jump to content

Recommended Posts

I've never in my life before used DOM (In fact I've never parsed an XML file or even used javascript for that mater), but I'm having quite a bit of trouble DOM at the moment.

 

I'm trying to extract all of the contents within a div element. However, there is a ton more divs within the one I'm trying to extract. Lucky the element I'm trying to extract has a unique id(e7593), so that shouldn't be a problem.

 

Unfortunately, I have no idea whatsoever as to what I'm doing. I've looked at examples and I've tried studying about this, however I can't quite figure out anything that works in my case. Everything I've read is more focused on XML, when in fact I'm dealing with html..

 

If someone could be as kind to help me to some resources that may be of use?

 

Or.. if I'm lucky enough ( and your quite bored ) could someone perhaps show me how to accomplish what I need to do step by step? Sure, giving me code without explaining is always accepted, but I wouldn't mind learning so I wouldn't have to bugger everyone.

 

Also, did I mention my very attractive sister (who is available) finds it very attractive when people help me learn about DOM?  ;D

Link to comment
https://forums.phpfreaks.com/topic/162230-solved-php-dom-questions/
Share on other sites

try this:

<?php
  $xml = simplexml_load_file('test.html'); //Load the HTML into SimpleXML
  $nodes = $xml->xpath('//div[@id="e7593"]'); //Find the DIV with id e7593
  $contents = "";
  foreach($nodes[0]->children() as $child){ //Loop over that node's children and add them to $contents
    $contents .= $child->asXML()."\n";
  }
  echo $contents;
?>

I seem to be getting this error with the above code

 

 

Fatal error: Cannot use object of type DOMNodeList as array in test.php on line 42

 

Whoops that was the wrong attempt.. lol the correct errors are:

Warning: simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity

 

and:

 

Fatal error: Call to a member function xpath() on a non-object in test.php on line 36

where is the HTML you are trying to load? a local file or a remote page? what did you change test.html to for your tests?

 

Actually at first I was trying to do a remote page, then I saved the file and got a BUNCH of parse errors.. for example:

Warning: simplexml_load_file() [function.simplexml-load-file]: test.html:138: parser error : Opening and ending tag mismatch: img line 138 and a in test.php on line 40

k, unfortunately you won't be able to use DOM then. DOM is VERY strict, and doesn't allow for small errors here and there (unlike a browser). It would be great if all web developers out there had strict XHTML, but it's just not true.

 

so, instead we'll have to use REGEX...do you have a link to the HTML page you are parsing? If not, can you post either all the HTML or at least the code you want with a few extra lines before and after?

k, unfortunately you won't be able to use DOM then. DOM is VERY strict, and doesn't allow for small errors here and there (unlike a browser). It would be great if all web developers out there had strict XHTML, but it's just not true.

 

so, instead we'll have to use REGEX...do you have a link to the HTML page you are parsing? If not, can you post either all the HTML or at least the code you want with a few extra lines before and after?

 

Well see therein lies the problem.. the divs I'm trying to acquire are not fixed values. It's mostly user submitted data from a website, and each time the data within the div is going to be random submitted data.. along with code outside of the div. The only fixed data involved with the div is it's unique id.

 

Would still REGEX still be a possibility given the above information?

Maybe this is a longshot but could it be that you're using an html 4 doctype?

HTML4 isn't XML  but SGML so it would be logical if that does'nt parse using the simplexml functions

 

I did a quick test using DOMElement instead of  simplexml

You might want to try that.

 

Here is the code test I did which might be usefull to you.

<?php
// put some html into the string $html
$html=<<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf8">
<title>Insert title here</title>
</head>
<body>
<br>
<div id="anassfullofninjas"><br><img src="someimage.jpg"></div>
</body>
HTML;

// create dom element
$doc = new DOMDocument();
$doc->validateOnParse = true;
$doc->loadHTML($html);

/**
*get innerHTML of node
*/
$innerHTML = '';
$elem = $doc->getElementById('anassfullofninjas');

// loop through all childNodes, getting html       
$children = $elem->childNodes;
foreach ($children as $child) {
    $tmp_doc = new DOMDocument();
    $tmp_doc->appendChild($tmp_doc->importNode($child,true));       
    $innerHTML .= $tmp_doc->saveHTML();
} 

echo $innerHTML;

Maybe this is a longshot but could it be that you're using an html 4 doctype?

HTML4 isn't XML  but SGML so it would be logical if that does'nt parse using the simplexml functions

 

I did a quick test using DOMElement instead of  simplexml

You might want to try that.

 

[Code..]
[/quote]

Nice, we're making some progress now. After a bit of testing, I've discovered that the variable "$html" has to have the actual html code in it there for me to use it.

I'm getting content off of a website that isn't mine and I've used curl to read the website into a variable. However, when I plug that variable in on the "$doc->loadHTML($html);" I get the following errors:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 418 in dom.php on line 26

 

Which would be this line "$doc->loadHTML($page);", which "$page" is coming from the curl execution I've previously done.

 

Warning: Invalid argument supplied for foreach() in dom.php on line 32

 

Not much more to stay about that error aye? It's exactly what you posted in your code.

 

 

However, when I save the source of the website ($page) to a local file and attempt to parse it I get the following:

 

Warning: Invalid argument supplied for foreach() in dom.php on line 32

However, when I have the html code to parse inside of the script already like you provided, it works..

Trying to figure out why this is so..

 

Edit: does it matter if the div I'm looking for is already nested instead of other fields?

To continue my edit ( accidentally hit submit and it wouldn't let me edit again.. )..

 

Does it matter if the div I'm looking for is already nested instead of other fields? Because when I try to recreate some of the code (use the same format as you did with the $html ) within the document to test it in there.. it's very basic and the divs aren't nested.

 

--

 

After some more testing I actually took a whole div from the website and put it inside of the script ( just like you had done ) and tried running it and here's the output I got:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 12 in dom.php on line 60

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 12 in dom.php on line 60

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 27 in dom.php on line 60

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 28 in dom.php on line 60

 

However, below all of that the desired output was outputted..

 

When I take $html out of the script though and try to load the external version ( still local ) of $html I only get this:

 

Warning: Invalid argument supplied for foreach() in dom.php on line 67

 

With this I don't even get the desired output below the error though..

Further more, after more testing I've figured out some more useful information I believe.

 

When I recreate the local $html into it's own separate file using the html source of the website stripped with the basic html,body tags with the single div (whole div from the website with all content in it) inside it works when I suppress the errors using "@" on the loadHTML.

 

When I don't suppress the errors I get the 4 "DOMDocument::loadHTML()" errors in the post above.. but the output is there. By suppressing it I get the desired outcome.

 

However, when I copy and paste the complete source of the website into the external ( still local ) file to parse I get a whole bunch of the "DOMDocument::loadHTML()" errors, and at the bottom it reads:

Warning: Invalid argument supplied for foreach() in dom.php on line 69

 

How could simply adding more source code to it stop my desired output and give me a ton more errors and a "foreach()" error?

 

Well, I've figured out my problem.. the website I was trying to extract the div from didn't have a DOCTYPE.. simply put a DOCTYPE before the rest of the websites code and it parsed just fine ( errors suppressed of course)..

 

Many hours wasted trying to figure out this problem when it was as simple as that..

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.