[SOLVED] Regular Expressions - Parsing xhtml?

hdshngout · February 27, 2007

I have been working on basic parser for xhtml, but I have run into a problem, I used curl to get the page and was able to retrieve the title, but I have, from there, run into a snag. After doing some research, I discovered, the code I was using to get the data could not read new lines (\n), and I can't very well parse without the new lines. I need help parsing between <body> and </body> the code I am using is

//Parser
preg_match('/<body>(.*)<\/body>/', $content, $file_match);
print_r($file_match);

(its not reading at all right now.) If anyone could help me, I would greatly appreciate it!

-Nick

c4onastick · February 27, 2007

So you're not getting any output from the above code? Could you post a little example of the xhtml you're going to be parsing?

You've got the right approach here. You probably can get away with greedy quantifiers since there should only be one '</body>' in the xhtml. I try to stay away from using '.*' and '.*?' as much as I can. It just gets you into trouble sometimes when you let anything in. Not that the solution I propose below is much more stringent (a lot uglier too), but at least you've set some conditions.

//Parser
preg_match('#<body>((?:[^<]+|<(?!/body))+)</body>#', $content, $file_match);
print_r($file_match);

Have you tried using the /s modifier with your code?

preg_match('/<body>(.*)<\/body>/s', $content, $file_match);

That's really the only reason why your's wouldn't work. (And the only way I could get it to fail when I was testing it, with a little HTML example I conjured up).

Are you getting these XHTML files from/on a Windows machine? Windows uses a different scheme for the '\n' sequence. '\r\n' if memory serves correctly. You may need to run a preg_replace on it first to get it back in to the UNIX (read: correct) scheme.

hdshngout · February 28, 2007

basically, what I am doing is letting users create "widgets" which can be used on other parts of the site. In keeping with the standard set out by other sites, the coding for these widgets are done in basic xhtml format. For example, if a "widget" were to contain

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head profile="http://www.netvibes.com/api/0.3/profile">
  <title>Hello world</title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

</head>
<body>
  <p>Hello world !</p>
</body>
</html>

The information I would need from this widget would be between,

<title></title>

and

<body></body>

I have been able to "parse" the text between <title></title>, but I have not been successful with. I don't know how to get the code from between the <body> and </body> out.

My test code, if anyone needs it is

<?php
$address = $_GET['address'];
$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);

$content2 = htmlentities($file_contents);
$content = $content2;
echo $content;
preg_match('/<title>(.*)<\/title>/', $content, $title_match);
preg_match('/<body>(.*)<\/body>/', $content, $file_match);
?>
<table>
<tr><td><strong><?php echo $title_match[1]; ?></strong></td></tr>
<tr><td><?php echo $file_data[1]; ?></td></tr>
</table>

I know its not secure, but I am just attempting to get a basic handle on getting what I need from the file.

c4onastick · February 28, 2007

Is this:

htmlentities($file_contents);

required?

Seems like extra processing to me, when you could just match one character '<' instead of 4, '<'.

Tell you what I do when scraping like this. Just do the curl step, echo the output to the terminal, paste the whole thing into RegexTester and see what you can't get to work. (Its way easier than modifying the php each try, quicker too.)

Your code from before:

preg_match('/<body>(.*)<\/body>/s', $content, $file_match);

works with your example above.

Sign In

[SOLVED] Regular Expressions - Parsing xhtml?

Recommended Posts

hdshngout

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

hdshngout

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information