Need a little regex help. My first little scrape.

sptrsn · February 27, 2012

After going through a dozen php scraper classes, I finally got one to give me back the html in a string. Woohoo!!

I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so....

$1,004.28

I Want that dollar figure from between the tags. "$1,004.28"

I've tried um-teen different ways. But, as you well know, regular expressions can be hard for newbs.

Is there some kind soul out there that can help me with this?

JAY6390 · February 27, 2012

~<span id="TotalDue">([^<]+)</span>~

The value will be stored in capture group 1

sptrsn · February 27, 2012

ok. That helps some. So then should this work?

I've obviously got something wrong as it doesn't work.

$str = '<span id="TotalDue">$1,004.28</span>';

$pattern = '/<span id="TotalDue">([^<]+)</span>/';

preg_match_all($pattern, $str, $matches);

print_r($matches[0]);

sptrsn · February 27, 2012

Well I found something else that worked. (Gotta love Google)

$str = '<span id="TotalDue">$1,004.28</span>';

$doc = new DOMDocument();
$doc->loadHtml($str);
$el = $doc->getElementById('TotalDue');
echo $el->textContent;

Hadn't ever even heard of that function, but I found an example that looked like it might do the trick.. and voila!

JAY6390 · February 27, 2012

Note that in my regex I used ~ ~ as the start/end delimiters, instead of /

That is why you are getting an error. While using the DOM Document is great, it's probably overkill for your example

AyKay47 · February 27, 2012

Note that in my regex I used ~ ~ as the start/end delimiters, instead of /

That is why you are getting an error. While using the DOM Document is great, it's probably overkill for your example

really, so is using a regex.

Simple string functions will do.

$str = '<span id="TotalDue">$1,004.28</span>';
$start_pos = strpos($str,">") + 1;
$end_pos = strpos($str, "<", $start_pos);
$string = substr($str, $start_pos, $end_pos);

*untested*

JAY6390 · February 27, 2012

Surely that would match the data between all tags on a page, not just the specific data in the span tag that the OP requested?

I'm assuming the string given is only part of the page, not the whole thing

silkfire · February 27, 2012

sptrsn, who taught you to use HTML scraping with regex? Please use DOM with XPath it's so much faster and precise. Your query would be as simple as: //span[@id=TotalDue]

AyKay47 · February 27, 2012

Surely that would match the data between all tags on a page, not just the specific data in the span tag that the OP requested?

I'm assuming the string given is only part of the page, not the whole thing

if $str contains an entire page rather then one element, then yeah string functions are useless.

JAY6390 · February 27, 2012

I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so....

$1,004.28

I'm pretty sure from this quote that its part of a full page of text

AyKay47 · February 27, 2012

I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so....

$1,004.28

I'm pretty sure from this quote that its part of a full page of text

well, then a few questions would be asked before I created any regex for the OP.

Is there only one page that you want to get a match from?

If no, then these questions would also be asked:

Will the desired match always be in a span?

Will the element always have the ID of "ToatalDue"?

sptrsn · February 29, 2012

To answer a couple questions.

This little snippet of data is on a county website and contains the back taxes for a specific property.

So... yes.... it is a consistent and unique ID on a page full of all kinds of data. The only thing I'm interested in is the back taxes... which... I intend to either fetch it live on page load when you view the detail page on my site for that property, or I'll run a job to fetch them all for the day and put it in the database. Depends on how much drag there is on my detail page. (foreclosure data etc)

So using the string function suggested by AyKay47 won't work for me, since I'm interested on one specific span tag, not all the tags.

I did not notice the ~~ on either side of the expression. My bad. I'm going to give that a try and see that works.

Silkfire asked who taught me to use regex for scraping. I simply found several examples of people using regex. I'd love to figure how to use dom and xpath, but in all the searches I did, I couldn't find an example of how to go about using it. I simply attempted using the stuff I could find examples for. However, upon your suggestion, I'm going to do a little searching and see if that might work for me. I appreciate the suggestion.

At the moment.... I do have a working solution. But just for my education, I'm going to attempt to get it working with these other ideas. Looking forward to learning more.

Thanks,

Steve

AyKay47 · February 29, 2012

I did not notice the ~~ on either side of the expression. My bad. I'm going to give that a try and see that works.

You can use a variety of delimiters, / and ~ serve the same purpose when wrapping a regex.

silkfire · February 29, 2012

You must not have searched for more than a minute. Here's a great guide:

http://www.2basetechnologies.com/blog/2010/08/03/1-screen-scraping-with-xpath-in-php.html

xyph · February 29, 2012

A lot of bad and uninformative advice is given in this thread. He never asked for the answer, he asked for help.

String functions will be the fastest (processing) method of solving this problem. It will also require the most complex code. Here's an example.

<?php

$body = '<a><whole><bunch><of><html><span id="TotalDue">$1,004.28</span><and><a><bunch><of><stuff><after>';

$start = '<span id="TotalDue">';
$end = '</span>';

// This will get us the character position of the search string
$offset = strpos( $body, $start );
// strpos() will return FALSE if the search string isn't found. This needs to be checked for
if( $offset === FALSE ) {
echo 'Unable to find search string';
} else {
// Since strpos() returns the offset at the BEGINNING of the start tag, we add the length
// of the start tag to the offset to find the offset at the END of the start tag
$offset += strlen( $start );
// Now that we know where the value we want starts, we need to find where it ends. strpos()
// supports starting at an offset, so we'll use it to find the first occurance of $end after
// that offset.
$offset_end = strpos( $body, $end, $offset );
// We can then use substr() to slice out the information we want. We find the length of the data
// by subtracting the offset of the end tag, with the offset of the end of the start tag
echo substr( $body, $offset, $offset_end - $offset );
}

?>

The RegEx solution is probably faster (processing) than DOM-XPath, in this situation. Because there are very few variable-length requirements in the RegEx, it will run relatively quickly. It is also very easy to code, and should be accurate unless there are multiple SPANs with that ID, in which case XPath could fail just as easily.

IMO - Parsing the entire HTML document is overkill when 'scraping' for a single, easily isolated substring. For more complex scraping, I agree that using DOM or a similar markup parser would be a much easier solution.

Hope this clears up the differences between the solutions available. JAY6390 posted a very ideal RegEx pattern, and if you wanted us to elaborate further as to how it works, feel free to ask.

Sign In

Need a little regex help. My first little scrape.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information