Jump to content


Photo

Pulling data from an XML RSS feed with preg_match


  • Please log in to reply
8 replies to this topic

#1 c4onastick

c4onastick
  • Members
  • PipPipPip
  • Advanced Member
  • 216 posts

Posted 13 October 2006 - 01:20 AM

Hi, I'm trying to get some data from an RSS feed into a more useable format. I'd like to extract the item number (N82E168... etc.) and the price ($249.99) from the XML.
Here's the part of the feed I'm looking at:

<strong>Item #:</strong> N82E16819103540</div> <div><strong>Price:</strong> <span style="color:red;font-weight:bold;">$249.99</span></div>

I'm getting the feed via curl, which is working fine. I need a little help with the regular expression part.

Here's what I've got, I tested it at http://regexlib.com/RETester.aspx and it worked fine there.

$ch = curl_init($url);
ob_start();
curl_exec($ch);
curl_close($ch);
$data = ob_get_contents();
ob_end_clean();

preg_match( '/Item #:<\/strong> (.*?)<\/div>/is', $data, $productid );

echo $productid[1];

Now this should just give echo that item number, but I don't get anything out of it.

Thanks in advance for the help!
Regex Tester::Unicode Regex::PHP Function List::MySQL 5.1
"Sorry sweetheart... but this all day sucker is down to the soggy white stick." -- Topper Harley

#2 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 13 October 2006 - 02:02 PM

That should work. Are you sure $data contains what you think it does?
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#3 dwees

dwees
  • Members
  • PipPipPip
  • Advanced Member
  • 47 posts
  • LocationUnited Kingdom

Posted 13 October 2006 - 02:28 PM

To make your life a bit easier why don't you use:

$data = str_replace ( ‘ ‘, ‘’, $data);

first to make sure you aren't trying to match any spaces that don't exist.



#4 c4onastick

c4onastick
  • Members
  • PipPipPip
  • Advanced Member
  • 216 posts

Posted 13 October 2006 - 04:23 PM

Thanks, I did get that to work. But you're right it looks like the $data string from curl is messed up a little. I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line). Thanks I'll give that a try. I ran into another problem I would like get the item number, price and description out of the <item></item> tags. There's multiple items in the feed. And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser). Plus there's all kinds of html in the actual feed that I'm going to have to sift through anyway.

I just copied a chunk of the feed into the data variable this time, with out using curl. and it worked.

I'm trying to build the second part of the regular expression like this (I direct copy/pasted this so if there's any errors in the $data variable hopefully it'll be evident):
$data =         <item>
.
        <title>
$169.00 - AMD Athlon 64 X2 3800+ Windsor 2.0GHz Socket AM2 Dual Core
Processor Model ADA3800CUBOX - Retail
</title>
/*
* More junk in here
*/
                                        <div
style="margin-bottom:15px;"><strong>Item #:</strong>
N82E16819103735</div>
                                        <div><strong>Price:</strong>
<span style="color:red;font-weight:bold;">$169.00</span></div>
/*
*More here
*/
</item>;

        preg_match_all( '/Item #:<\/strong>(.*?)<\/div>.+?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>/is', $data, $productid );
       
print_r($productid);

Then I should get an array with the price and item number. Although this is all I get now:
Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )

It looks like I need to clean up $data to make it work better if I'm getting it through curl. How can I do that? dwees, I tried str_replace, same problem, I think it may be a problem with the regexp. (I've gotten similar stuff to work on html feeds, I'm wondering why xml is being so onry.) Thank you both very much for the help!
Regex Tester::Unicode Regex::PHP Function List::MySQL 5.1
"Sorry sweetheart... but this all day sucker is down to the soggy white stick." -- Topper Harley

#5 c4onastick

c4onastick
  • Members
  • PipPipPip
  • Advanced Member
  • 216 posts

Posted 13 October 2006 - 05:03 PM

It just dawned on me when I was editing some of the junk out of the $data variable for that last post. I might be being too vague in the regexp. I did get just the Item number grab to work. But for multiple sub-query's, since I only want data between the <item> tags, maybe it should be more like this:

Patern:
'/<item>.*?<title>.*?\$[0-9]+[.][0-9]{2} - (.*?)<\/title>.*?Item #:<\/strong>(.*?)<\/div>.*?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>.*?<\/item>/is'

This should give me an array something like this:
Array (
   [0] => Array (Whole query, each match)
   [1] => Array (Item Description between the title tags w/o the price, each match)
   [2] => Array (Item Number, each match)
   [3] => Array (Price, each match)
)
Another question on the side. I'm trying to understand the '?' symbol's (not a pun!) function. According to the RegExLib.com definition, "0 or 1 of the previous expression; also forces minimal matching when an expression might match several strings within a search string." So essentially the '.*?' character set matches as few characters as possible until the next explicit string I define (in the example above, first occurance, the <title> tag). Is that correct?
Regex Tester::Unicode Regex::PHP Function List::MySQL 5.1
"Sorry sweetheart... but this all day sucker is down to the soggy white stick." -- Topper Harley

#6 c4onastick

c4onastick
  • Members
  • PipPipPip
  • Advanced Member
  • 216 posts

Posted 13 October 2006 - 05:28 PM

Bear with me here:

I tried the patern above. And again nothing. I think the problem lies with the data variable. I added a
$data = '';

$ch = curl_init($url);
ob_start();
curl_exec($ch);
curl_close($ch);
$data = ob_get_contents();
ob_end_clean();

print_r($data);
to see what was comming out, it looked fine on the page, but I took a look at the source, and here's a chunk of what came out:
<item>
			<title><![CDATA[$185.00 - AMD Athlon 64 X2 3800+ Manchester 2.0GHz Socket 939 Dual Core Processor Model ADA3800BVBOX - Retail]]></title>
			<link>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&amp;CMP=OTC-RSS</link>
			<description>
				&lt;div style=&quot;width:125px;float:left;clear:none;border:1px solid #ccc;background-color:#fff;padding:15px 5px;margin:10px 10px 10px 0px;&quot;&gt;

				&lt;img border=&quot;0&quot; src=&quot;http://images10.newegg.com/NeweggImage/ProductImageCompressAll125/19-103-562-01.jpg&quot; width=&quot;125&quot; title=&quot;&quot; alt=&quot;&quot; /&gt;
				&lt;/div&gt;

				&lt;p style=&quot;margin:15px;&quot;&gt;
					&lt;div&gt;&lt;strong&gt;Model #:&lt;/strong&gt; ADA3800BVBOX&lt;/div&gt;
					&lt;div style=&quot;margin-bottom:15px;&quot;&gt;&lt;strong&gt;Item #:&lt;/strong&gt; N82E16819103562&lt;/div&gt;

					&lt;div&gt;&lt;strong&gt;Price:&lt;/strong&gt; &lt;span style=&quot;color:red;font-weight:bold;&quot;&gt;$185.00&lt;/span&gt;&lt;/div&gt;
					&lt;div&gt;&lt;a href=&quot;http://secure.newegg.com/NewVersion/Shopping/AddToCart.asp?submit=ADD&amp;ItemList=N82E16819103562&amp;CMP=OTC-RSS&quot; target=&quot;NEWEGGCART&quot;&gt;Add To Cart&lt;/a&gt;&lt;/div&gt;

				&lt;/p&gt;
			</description>
			<pubDate>Fri, 13 Oct 2006 10:19:08 GMT</pubDate>
			<guid>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&amp;CMP=OTC-RSS</guid>
			<comments>http://www.newegg.com/Product/CustratingReview.asp?Item=N82E16819103562&amp;CMP=OTC-RSS</comments>
		</item>

Rut-roh. Its taking out all the html entities, and specifically ones that I'm counting on being there for my regexp! How do I stop/fix this?
Regex Tester::Unicode Regex::PHP Function List::MySQL 5.1
"Sorry sweetheart... but this all day sucker is down to the soggy white stick." -- Topper Harley

#7 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 13 October 2006 - 07:43 PM

I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line).


/s includes a new line in ..

And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser).


Use the right tool for the job. It sounds like you need a SAX parser.

I'm trying to understand the '?' symbol's (not a pun!) function.


Modifiers are greedy by default--they try to match as much as possible. ?'s make a modifier ungreedy.


Do the entities go away if you use echo instead of print_r?



Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#8 c4onastick

c4onastick
  • Members
  • PipPipPip
  • Advanced Member
  • 216 posts

Posted 13 October 2006 - 10:11 PM

No actually, echo produces the same results. I'm not entirely sure its curl's fault though, because it only changes out a few of the entites, and keeps all the others. So maybe I'll just have to go back to html scraping! (I'd like to stick with XML though, more resiliant to change)
Regex Tester::Unicode Regex::PHP Function List::MySQL 5.1
"Sorry sweetheart... but this all day sucker is down to the soggy white stick." -- Topper Harley

#9 dwees

dwees
  • Members
  • PipPipPip
  • Advanced Member
  • 47 posts
  • LocationUnited Kingdom

Posted 18 October 2006 - 05:56 AM

Isn't there a php function that converts HTML entities back into the characters you'd like to see?  If so, then you can just run the $data through that function first.

Dave




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users