[SOLVED] PHP Spider and Regular Expressions

jjacquay712 · October 27, 2008

I need to extract the text out of the first tag in a web page for my spider. Im using preg match all with this pattern: /([^<]*)<\/p> but its not working. I have no idea how to use Regular Expressions, so any help would be appreciated.

Jeremysr · October 27, 2008

The only problem with your regex is that it won't work if there is a '<' inside the tags. This should work:

preg_match('/<p>(.*?)<\/p>/', $text, $matches);
$p_tag_text = $matches[1];

msiekkinen · October 27, 2008

The only problem with your regex is that it won't work if there is a '<' inside the tags. This should work:
preg_match('/(.*?)<\/p>/', $text, $matches);
$p_tag_text = $matches[1];

a more robust regex:

~<\s*p\b[^>]*>(.*?)<\s*/\s*p\s*>~is"

however if you have something like

some text

More text

Final text

You'll only capture up to "More Text" ... which I imagine might not be what you want. Better approach would be to use tidy or the dom processing libraries to access the first P you find so you can properly get all it's children.

ghostdog74 · October 28, 2008

I need to extract the text out of the first tag in a web page for my spider. Im using preg match all with this pattern: /([^<]*)<\/p> but its not working. I have no idea how to use Regular Expressions, so any help would be appreciated.

there's no need to use regex, if you are not familiar. There are many string methods you can use in PHP, such as strpos

$startpos = strpos($data,"<p>");
$endpos = strpos($data,"</p>");
echo substr($data,$startpos+strlen("<p>"),$endpos - $startpos);

Sign In

[SOLVED] PHP Spider and Regular Expressions

Recommended Posts

jjacquay712

Link to comment

Share on other sites

Jeremysr

Link to comment

Share on other sites

msiekkinen

Link to comment

Share on other sites

ghostdog74

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information