hdshngout Posted February 27, 2007 Share Posted February 27, 2007 I have been working on basic parser for xhtml, but I have run into a problem, I used curl to get the page and was able to retrieve the title, but I have, from there, run into a snag. After doing some research, I discovered, the code I was using to get the data could not read new lines (\n), and I can't very well parse without the new lines. I need help parsing between <body> and </body> the code I am using is //Parser preg_match('/<body>(.*)<\/body>/', $content, $file_match); print_r($file_match); (its not reading at all right now.) If anyone could help me, I would greatly appreciate it! -Nick Quote Link to comment Share on other sites More sharing options...
c4onastick Posted February 27, 2007 Share Posted February 27, 2007 So you're not getting any output from the above code? Could you post a little example of the xhtml you're going to be parsing? You've got the right approach here. You probably can get away with greedy quantifiers since there should only be one '</body>' in the xhtml. I try to stay away from using '.*' and '.*?' as much as I can. It just gets you into trouble sometimes when you let anything in. Not that the solution I propose below is much more stringent (a lot uglier too), but at least you've set some conditions. //Parser preg_match('#<body>((?:[^<]+|<(?!/body))+)</body>#', $content, $file_match); print_r($file_match); Have you tried using the /s modifier with your code? preg_match('/<body>(.*)<\/body>/s', $content, $file_match); That's really the only reason why your's wouldn't work. (And the only way I could get it to fail when I was testing it, with a little HTML example I conjured up). Are you getting these XHTML files from/on a Windows machine? Windows uses a different scheme for the '\n' sequence. '\r\n' if memory serves correctly. You may need to run a preg_replace on it first to get it back in to the UNIX (read: correct) scheme. Quote Link to comment Share on other sites More sharing options...
hdshngout Posted February 28, 2007 Author Share Posted February 28, 2007 basically, what I am doing is letting users create "widgets" which can be used on other parts of the site. In keeping with the standard set out by other sites, the coding for these widgets are done in basic xhtml format. For example, if a "widget" were to contain <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head profile="http://www.netvibes.com/api/0.3/profile"> <title>Hello world</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <p>Hello world !</p> </body> </html> The information I would need from this widget would be between, <title></title> and <body></body> I have been able to "parse" the text between <title></title>, but I have not been successful with. I don't know how to get the code from between the <body> and </body> out. My test code, if anyone needs it is <?php $address = $_GET['address']; $ch = curl_init(); $timeout = 5; // set to zero for no timeout curl_setopt ($ch, CURLOPT_URL, $address); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $file_contents = curl_exec($ch); curl_close($ch); $content2 = htmlentities($file_contents); $content = $content2; echo $content; preg_match('/<title>(.*)<\/title>/', $content, $title_match); preg_match('/<body>(.*)<\/body>/', $content, $file_match); ?> <table> <tr><td><strong><?php echo $title_match[1]; ?></strong></td></tr> <tr><td><?php echo $file_data[1]; ?></td></tr> </table> I know its not secure, but I am just attempting to get a basic handle on getting what I need from the file. Quote Link to comment Share on other sites More sharing options...
c4onastick Posted February 28, 2007 Share Posted February 28, 2007 Is this: htmlentities($file_contents); required? Seems like extra processing to me, when you could just match one character '<' instead of 4, '<'. Tell you what I do when scraping like this. Just do the curl step, echo the output to the terminal, paste the whole thing into RegexTester and see what you can't get to work. (Its way easier than modifying the php each try, quicker too.) Your code from before: preg_match('/<body>(.*)<\/body>/s', $content, $file_match); works with your example above. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.