Jump to content

Need a little regex help. My first little scrape.


Recommended Posts

After going through a dozen php scraper classes, I finally got one to give me back the html in a string. Woohoo!!

 

I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so....

 

<span id="TotalDue">$1,004.28</span>

 

I Want that dollar figure from between the tags. "$1,004.28"

 

I've tried um-teen different ways. But, as you well know, regular expressions can be hard for newbs.

 

Is there some kind soul out there that can help me with this?

 

 

ok. That helps some. So then should this work?

I've obviously got something wrong as it doesn't work.

 

$str = '<span id="TotalDue">$1,004.28</span>';

$pattern = '/<span id="TotalDue">([^<]+)</span>/';

preg_match_all($pattern, $str, $matches);

print_r($matches[0]);

Well I found something else that worked. (Gotta love Google)

 

$str = '<span id="TotalDue">$1,004.28</span>';

$doc = new DOMDocument();
$doc->loadHtml($str);
$el = $doc->getElementById('TotalDue');
echo $el->textContent;

 

Hadn't ever even heard of that function, but I found an example that looked like it might do the trick.. and voila!

Note that in my regex I used ~ ~ as the start/end delimiters, instead of /

That is why you are getting an error. While using the DOM Document is great, it's probably overkill for your example

 

really, so is using a regex.

Simple string functions will do.

 

$str = '<span id="TotalDue">$1,004.28</span>';
$start_pos = strpos($str,">") + 1;
$end_pos = strpos($str, "<", $start_pos);
$string = substr($str, $start_pos, $end_pos);

 

*untested*

Surely that would match the data between all tags on a page, not just the specific data in the span tag that the OP requested?

I'm assuming the string given is only part of the page, not the whole thing

 

if $str contains an entire page rather then one element, then yeah string functions are useless.

I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so....

 

<span id="TotalDue">$1,004.28</span>

I'm pretty sure from this quote that its part of a full page of text

I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so....

 

<span id="TotalDue">$1,004.28</span>

I'm pretty sure from this quote that its part of a full page of text

 

well, then a few questions would be asked before I created any regex for the OP.

Is there only one page that you want to get a match from?

If no, then these questions would also be asked:

Will the desired match always be in a span?

Will the element always have the ID of "ToatalDue"?

To answer a couple questions.

 

This little snippet of data is on a county website and contains the back taxes for a specific property.

So... yes.... it is a consistent and unique ID on a page full of all kinds of data. The only thing I'm interested in is the back taxes... which... I intend to either fetch it live on page load when you view the detail page on my site for that property, or I'll run a job to fetch them all for the day and put it in the database. Depends on how much drag there is on my detail page. (foreclosure data etc)

 

So using the string function suggested by AyKay47 won't work for me, since I'm interested on one specific span tag, not all the tags.

I did not notice the ~~ on either side of the expression. My bad. I'm going to give that a try and see that works.

 

Silkfire asked who taught me to use regex for scraping. I simply found several examples of people using regex. I'd love to figure how to use dom and xpath, but in all the searches I did, I couldn't find an example of how to go about using it. I simply attempted using the stuff I could find examples for.  However, upon your suggestion, I'm going to do a little searching and see if that might work for me. I appreciate the suggestion.

 

At the moment.... I do have a working solution. But just for my education, I'm going to attempt to get it working with these other ideas. Looking forward to learning more.

 

Thanks,

Steve

 

 

 

 

I did not notice the ~~ on either side of the expression. My bad. I'm going to give that a try and see that works.

 

You can use a variety of delimiters, / and ~ serve the same purpose when wrapping a regex.

 

A lot of bad and uninformative advice is given in this thread. He never asked for the answer, he asked for help.

 

String functions will be the fastest (processing) method of solving this problem. It will also require the most complex code. Here's an example.

<?php

$body = '<a><whole><bunch><of><html><span id="TotalDue">$1,004.28</span><and><a><bunch><of><stuff><after>';

$start = '<span id="TotalDue">';
$end = '</span>';

// This will get us the character position of the search string
$offset = strpos( $body, $start );
// strpos() will return FALSE if the search string isn't found. This needs to be checked for
if( $offset === FALSE ) {
echo 'Unable to find search string';
} else {
// Since strpos() returns the offset at the BEGINNING of the start tag, we add the length
// of the start tag to the offset to find the offset at the END of the start tag
$offset += strlen( $start );
// Now that we know where the value we want starts, we need to find where it ends. strpos()
// supports starting at an offset, so we'll use it to find the first occurance of $end after
// that offset.
$offset_end = strpos( $body, $end, $offset );
// We can then use substr() to slice out the information we want. We find the length of the data
// by subtracting the offset of the end tag, with the offset of the end of the start tag
echo substr( $body, $offset, $offset_end - $offset );
}

?>

 

The RegEx solution is probably faster (processing) than DOM-XPath, in this situation. Because there are very few variable-length requirements in the RegEx, it will run relatively quickly. It is also very easy to code, and should be accurate unless there are multiple SPANs with that ID, in which case XPath could fail just as easily.

 

IMO - Parsing the entire HTML document is overkill when 'scraping' for a single, easily isolated substring. For more complex scraping, I agree that using DOM or a similar markup parser would be a much easier solution.

 

Hope this clears up the differences between the solutions available. JAY6390 posted a very ideal RegEx pattern, and if you wanted us to elaborate further as to how it works, feel free to ask.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.