sptrsn Posted February 27, 2012 Share Posted February 27, 2012 After going through a dozen php scraper classes, I finally got one to give me back the html in a string. Woohoo!! I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so.... <span id="TotalDue">$1,004.28</span> I Want that dollar figure from between the tags. "$1,004.28" I've tried um-teen different ways. But, as you well know, regular expressions can be hard for newbs. Is there some kind soul out there that can help me with this? Quote Link to comment Share on other sites More sharing options...
JAY6390 Posted February 27, 2012 Share Posted February 27, 2012 ~<span id="TotalDue">([^<]+)</span>~ The value will be stored in capture group 1 Quote Link to comment Share on other sites More sharing options...
sptrsn Posted February 27, 2012 Author Share Posted February 27, 2012 ok. That helps some. So then should this work? I've obviously got something wrong as it doesn't work. $str = '<span id="TotalDue">$1,004.28</span>'; $pattern = '/<span id="TotalDue">([^<]+)</span>/'; preg_match_all($pattern, $str, $matches); print_r($matches[0]); Quote Link to comment Share on other sites More sharing options...
sptrsn Posted February 27, 2012 Author Share Posted February 27, 2012 Well I found something else that worked. (Gotta love Google) $str = '<span id="TotalDue">$1,004.28</span>'; $doc = new DOMDocument(); $doc->loadHtml($str); $el = $doc->getElementById('TotalDue'); echo $el->textContent; Hadn't ever even heard of that function, but I found an example that looked like it might do the trick.. and voila! Quote Link to comment Share on other sites More sharing options...
JAY6390 Posted February 27, 2012 Share Posted February 27, 2012 Note that in my regex I used ~ ~ as the start/end delimiters, instead of / That is why you are getting an error. While using the DOM Document is great, it's probably overkill for your example Quote Link to comment Share on other sites More sharing options...
AyKay47 Posted February 27, 2012 Share Posted February 27, 2012 Note that in my regex I used ~ ~ as the start/end delimiters, instead of / That is why you are getting an error. While using the DOM Document is great, it's probably overkill for your example really, so is using a regex. Simple string functions will do. $str = '<span id="TotalDue">$1,004.28</span>'; $start_pos = strpos($str,">") + 1; $end_pos = strpos($str, "<", $start_pos); $string = substr($str, $start_pos, $end_pos); *untested* Quote Link to comment Share on other sites More sharing options...
JAY6390 Posted February 27, 2012 Share Posted February 27, 2012 Surely that would match the data between all tags on a page, not just the specific data in the span tag that the OP requested? I'm assuming the string given is only part of the page, not the whole thing Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 27, 2012 Share Posted February 27, 2012 sptrsn, who taught you to use HTML scraping with regex? Please use DOM with XPath it's so much faster and precise. Your query would be as simple as: //span[@id=TotalDue] Quote Link to comment Share on other sites More sharing options...
AyKay47 Posted February 27, 2012 Share Posted February 27, 2012 Surely that would match the data between all tags on a page, not just the specific data in the span tag that the OP requested? I'm assuming the string given is only part of the page, not the whole thing if $str contains an entire page rather then one element, then yeah string functions are useless. Quote Link to comment Share on other sites More sharing options...
JAY6390 Posted February 27, 2012 Share Posted February 27, 2012 I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so.... <span id="TotalDue">$1,004.28</span> I'm pretty sure from this quote that its part of a full page of text Quote Link to comment Share on other sites More sharing options...
AyKay47 Posted February 27, 2012 Share Posted February 27, 2012 I only need one little piece of data from this page. And the data that I want is in a span tag with an id.. Like so.... <span id="TotalDue">$1,004.28</span> I'm pretty sure from this quote that its part of a full page of text well, then a few questions would be asked before I created any regex for the OP. Is there only one page that you want to get a match from? If no, then these questions would also be asked: Will the desired match always be in a span? Will the element always have the ID of "ToatalDue"? Quote Link to comment Share on other sites More sharing options...
sptrsn Posted February 29, 2012 Author Share Posted February 29, 2012 To answer a couple questions. This little snippet of data is on a county website and contains the back taxes for a specific property. So... yes.... it is a consistent and unique ID on a page full of all kinds of data. The only thing I'm interested in is the back taxes... which... I intend to either fetch it live on page load when you view the detail page on my site for that property, or I'll run a job to fetch them all for the day and put it in the database. Depends on how much drag there is on my detail page. (foreclosure data etc) So using the string function suggested by AyKay47 won't work for me, since I'm interested on one specific span tag, not all the tags. I did not notice the ~~ on either side of the expression. My bad. I'm going to give that a try and see that works. Silkfire asked who taught me to use regex for scraping. I simply found several examples of people using regex. I'd love to figure how to use dom and xpath, but in all the searches I did, I couldn't find an example of how to go about using it. I simply attempted using the stuff I could find examples for. However, upon your suggestion, I'm going to do a little searching and see if that might work for me. I appreciate the suggestion. At the moment.... I do have a working solution. But just for my education, I'm going to attempt to get it working with these other ideas. Looking forward to learning more. Thanks, Steve Quote Link to comment Share on other sites More sharing options...
AyKay47 Posted February 29, 2012 Share Posted February 29, 2012 I did not notice the ~~ on either side of the expression. My bad. I'm going to give that a try and see that works. You can use a variety of delimiters, / and ~ serve the same purpose when wrapping a regex. Quote Link to comment Share on other sites More sharing options...
silkfire Posted February 29, 2012 Share Posted February 29, 2012 You must not have searched for more than a minute. Here's a great guide: http://www.2basetechnologies.com/blog/2010/08/03/1-screen-scraping-with-xpath-in-php.html Quote Link to comment Share on other sites More sharing options...
xyph Posted February 29, 2012 Share Posted February 29, 2012 A lot of bad and uninformative advice is given in this thread. He never asked for the answer, he asked for help. String functions will be the fastest (processing) method of solving this problem. It will also require the most complex code. Here's an example. <?php $body = '<a><whole><bunch><of><html><span id="TotalDue">$1,004.28</span><and><a><bunch><of><stuff><after>'; $start = '<span id="TotalDue">'; $end = '</span>'; // This will get us the character position of the search string $offset = strpos( $body, $start ); // strpos() will return FALSE if the search string isn't found. This needs to be checked for if( $offset === FALSE ) { echo 'Unable to find search string'; } else { // Since strpos() returns the offset at the BEGINNING of the start tag, we add the length // of the start tag to the offset to find the offset at the END of the start tag $offset += strlen( $start ); // Now that we know where the value we want starts, we need to find where it ends. strpos() // supports starting at an offset, so we'll use it to find the first occurance of $end after // that offset. $offset_end = strpos( $body, $end, $offset ); // We can then use substr() to slice out the information we want. We find the length of the data // by subtracting the offset of the end tag, with the offset of the end of the start tag echo substr( $body, $offset, $offset_end - $offset ); } ?> The RegEx solution is probably faster (processing) than DOM-XPath, in this situation. Because there are very few variable-length requirements in the RegEx, it will run relatively quickly. It is also very easy to code, and should be accurate unless there are multiple SPANs with that ID, in which case XPath could fail just as easily. IMO - Parsing the entire HTML document is overkill when 'scraping' for a single, easily isolated substring. For more complex scraping, I agree that using DOM or a similar markup parser would be a much easier solution. Hope this clears up the differences between the solutions available. JAY6390 posted a very ideal RegEx pattern, and if you wanted us to elaborate further as to how it works, feel free to ask. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.