Jump to content

[SOLVED] Regular Expressions - Parsing xhtml?


hdshngout

Recommended Posts

I have been working on basic parser for xhtml, but I have run into a problem, I used curl to get the page and was able to retrieve the title, but I have, from there, run into a snag.  After doing some research, I discovered, the code I was using to get the data could not read new lines (\n), and I can't very well parse without the new lines.  I need help parsing between <body> and </body>  the code I am using is

 

//Parser
preg_match('/<body>(.*)<\/body>/', $content, $file_match);
print_r($file_match);

 

(its not reading at all right now.)  If anyone could help me, I would greatly appreciate it!

 

-Nick

Link to comment
Share on other sites

So you're not getting any output from the above code? Could you post a little example of the xhtml you're going to be parsing?

 

You've got the right approach here. You probably can get away with greedy quantifiers since there should only be one '</body>' in the xhtml. I try to stay away from using '.*' and '.*?' as much as I can. It just gets you into trouble sometimes when you let anything in. Not that the solution I propose below is much more stringent (a lot uglier too), but at least you've set some conditions.

 

//Parser
preg_match('#<body>((?:[^<]+|<(?!/body))+)</body>#', $content, $file_match);
print_r($file_match);

 

Have you tried using the /s modifier with your code?

preg_match('/<body>(.*)<\/body>/s', $content, $file_match);

That's really the only reason why your's wouldn't work. (And the only way I could get it to fail when I was testing it, with a little HTML example I conjured up).

 

Are you getting these XHTML files from/on a Windows machine? Windows uses a different scheme for the '\n'  sequence. '\r\n' if memory serves correctly. You may need to run a preg_replace on it first to get it back in to the UNIX (read: correct) scheme.

Link to comment
Share on other sites

basically, what I am doing is letting users create "widgets" which can be used on other parts of the site.  In keeping with the standard set out by other sites, the coding for these widgets are done in basic xhtml format.  For example, if a "widget" were to contain

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head profile="http://www.netvibes.com/api/0.3/profile">
  <title>Hello world</title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

</head>
<body>
  <p>Hello world !</p>
</body>
</html>

 

The information I would need from this widget would be between,

 

<title></title>

 

and

 

<body></body>

 

I have been able to "parse" the text between <title></title>, but I have not been successful with.  I don't know how to get the code from between the <body> and </body> out.

 

My test code, if anyone needs it is

 

<?php
$address = $_GET['address'];
$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);

$content2 = htmlentities($file_contents);
$content = $content2;
echo $content;
preg_match('/<title>(.*)<\/title>/', $content, $title_match);
preg_match('/<body>(.*)<\/body>/', $content, $file_match);
?>
<table>
<tr><td><strong><?php echo $title_match[1]; ?></strong></td></tr>
<tr><td><?php echo $file_data[1]; ?></td></tr>
</table>

 

I know its not secure, but I am just attempting to get a basic handle on getting what I need from the file.

Link to comment
Share on other sites

Is this:

htmlentities($file_contents);

required?

 

Seems like extra processing to me, when you could just match one character '<' instead of 4, '<'.

Tell you what I do when scraping like this. Just do the curl step, echo the output to the terminal, paste the whole thing into RegexTester and see what you can't get to work. (Its way easier than modifying the php each try, quicker too.)

 

Your code from before:

preg_match('/<body>(.*)<\/body>/s', $content, $file_match);

works with your example above.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.