Pulling data from an XML RSS feed with preg_match

c4onastick · October 13, 2006

Hi, I'm trying to get some data from an RSS feed into a more useable format. I'd like to extract the item number (N82E168... etc.) and the price ($249.99) from the XML.
Here's the part of the feed I'm looking at:

[code]<strong>Item #:</strong> N82E16819103540</div> <div><strong>Price:</strong> <span style="color:red;font-weight:bold;">$249.99</span></div>[/code]

I'm getting the feed via curl, which is working fine. I need a little help with the regular expression part.

Here's what I've got, I tested it at http://regexlib.com/RETester.aspx and it worked fine there.

[code]$ch = curl_init($url);
ob_start();
curl_exec($ch);
curl_close($ch);
$data = ob_get_contents();
ob_end_clean();

preg_match( '/Item #:<\/strong> (.*?)<\/div>/is', $data, $productid );

echo $productid[1];[/code]

Now this should just give echo that item number, but I don't get anything out of it.

Thanks in advance for the help!

effigy · October 13, 2006

That should work. Are you sure[tt] $data [/tt]contains what you think it does?

dwees · October 13, 2006

To make your life a bit easier why don't you use:

[code]$data = str_replace ( ‘ ‘, ‘’, $data); [/code]

first to make sure you aren't trying to match any spaces that don't exist.

c4onastick · October 13, 2006

Thanks, I did get that to work. But you're right it looks like the $data string from curl is messed up a little. I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line). Thanks I'll give that a try. I ran into another problem I would like get the item number, price and description out of the <item></item> tags. There's multiple items in the feed. And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser). Plus there's all kinds of html in the actual feed that I'm going to have to sift through anyway.

I just copied a chunk of the feed into the data variable this time, with out using curl. and it worked.

I'm trying to build the second part of the regular expression like this (I direct copy/pasted this so if there's any errors in the $data variable hopefully it'll be evident):
[code]$data = <item>
.
<title>
$169.00 - AMD Athlon 64 X2 3800+ Windsor 2.0GHz Socket AM2 Dual Core
Processor Model ADA3800CUBOX - Retail
</title>
/*
* More junk in here
*/
<div
style="margin-bottom:15px;"><strong>Item #:</strong>
N82E16819103735</div>
<div><strong>Price:</strong>
<span style="color:red;font-weight:bold;">$169.00</span></div>
/*
*More here
*/
</item>;

preg_match_all( '/Item #:<\/strong>(.*?)<\/div>.+?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>/is', $data, $productid );

print_r($productid);[/code]

Then I should get an array with the price and item number. Although this is all I get now:
[code]Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )[/code]

It looks like I need to clean up $data to make it work better if I'm getting it through curl. How can I do that? dwees, I tried str_replace, same problem, I think it may be a problem with the regexp. (I've gotten similar stuff to work on html feeds, I'm wondering why xml is being so onry.) Thank you both very much for the help!

c4onastick · October 13, 2006

It just dawned on me when I was editing some of the junk out of the $data variable for that last post. I might be being too vague in the regexp. I did get just the Item number grab to work. But for multiple sub-query's, since I only want data between the <item> tags, maybe it should be more like this:

Patern:
[code]'/<item>.*?<title>.*?\$[0-9]+[.][0-9]{2} - (.*?)<\/title>.*?Item #:<\/strong>(.*?)<\/div>.*?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>.*?<\/item>/is'[/code]

This should give me an array something like this:
[code]Array (
[0] => Array (Whole query, each match)
[1] => Array (Item Description between the title tags w/o the price, each match)
[2] => Array (Item Number, each match)
[3] => Array (Price, each match)
)[/code]
Another question on the side. I'm trying to understand the '?' symbol's (not a pun!) function. According to the RegExLib.com definition, "0 or 1 of the previous expression; also forces minimal matching when an expression might match several strings within a search string." So essentially the '.*?' character set matches as few characters as possible until the next explicit string I define (in the example above, first occurance, the <title> tag). Is that correct?

c4onastick · October 13, 2006

Bear with me here:

I tried the patern above. And again nothing. I think the problem lies with the data variable. I added a
[code]$data = '';

$ch = curl_init($url);
ob_start();
curl_exec($ch);
curl_close($ch);
$data = ob_get_contents();
ob_end_clean();

print_r($data);[/code]
to see what was comming out, it looked fine on the page, but I took a look at the source, and here's a chunk of what came out:
[code] <item>
<title><![CDATA[$185.00 - AMD Athlon 64 X2 3800+ Manchester 2.0GHz Socket 939 Dual Core Processor Model ADA3800BVBOX - Retail]]></title>
<link>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&CMP=OTC-RSS</link>
<description>
<div style="width:125px;float:left;clear:none;border:1px solid #ccc;background-color:#fff;padding:15px 5px;margin:10px 10px 10px 0px;">

<img border="0" src="http://images10.newegg.com/NeweggImage/ProductImageCompressAll125/19-103-562-01.jpg" width="125" title="" alt="" />
</div>

<p style="margin:15px;">
<div><strong>Model #:</strong> ADA3800BVBOX</div>
<div style="margin-bottom:15px;"><strong>Item #:</strong> N82E16819103562</div>

<div><strong>Price:</strong> <span style="color:red;font-weight:bold;">$185.00</span></div>
<div><a href="http://secure.newegg.com/NewVersion/Shopping/AddToCart.asp?submit=ADD&ItemList=N82E16819103562&CMP=OTC-RSS" target="NEWEGGCART">Add To Cart</a></div>

</p>
</description>
<pubDate>Fri, 13 Oct 2006 10:19:08 GMT</pubDate>
<guid>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&CMP=OTC-RSS</guid>
<comments>http://www.newegg.com/Product/CustratingReview.asp?Item=N82E16819103562&CMP=OTC-RSS</comments>
</item>[/code]

Rut-roh. Its taking out all the html entities, and specifically ones that I'm counting on being there for my regexp! How do I stop/fix this?

effigy · October 13, 2006

[quote]I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line).[/quote]

[tt]/s [/tt]includes a new line in[tt] .[/tt].

[quote]And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser).[/quote]

Use the right tool for the job. It sounds like you need a SAX parser.

[quote]I'm trying to understand the '?' symbol's (not a pun!) function.[/quote]

Modifiers are greedy by default--they try to match as much as possible. [tt]?[/tt]'s make a modifier ungreedy.

Do the entities go away if you use echo instead of print_r?

c4onastick · October 13, 2006

No actually, echo produces the same results. I'm not entirely sure its curl's fault though, because it only changes out a few of the entites, and keeps all the others. So maybe I'll just have to go back to html scraping! (I'd like to stick with XML though, more resiliant to change)

dwees · October 18, 2006

Isn't there a php function that converts HTML entities back into the characters you'd like to see? If so, then you can just run the $data through that function first.

Dave

Sign In

Pulling data from an XML RSS feed with preg_match

Recommended Posts

c4onastick

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

dwees

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

dwees

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information