Jump to content

Pulling data from an XML RSS feed with preg_match


c4onastick

Recommended Posts

Hi, I'm trying to get some data from an RSS feed into a more useable format. I'd like to extract the item number (N82E168... etc.) and the price ($249.99) from the XML.
Here's the part of the feed I'm looking at:

[code]<strong>Item #:</strong> N82E16819103540</div> <div><strong>Price:</strong> <span style="color:red;font-weight:bold;">$249.99</span></div>[/code]

I'm getting the feed via curl, which is working fine. I need a little help with the regular expression part.

Here's what I've got, I tested it at http://regexlib.com/RETester.aspx and it worked fine there.

[code]$ch = curl_init($url);
ob_start();
curl_exec($ch);
curl_close($ch);
$data = ob_get_contents();
ob_end_clean();

preg_match( '/Item #:<\/strong> (.*?)<\/div>/is', $data, $productid );

echo $productid[1];[/code]

Now this should just give echo that item number, but I don't get anything out of it.

Thanks in advance for the help!
Link to comment
Share on other sites

Thanks, I did get that to work. But you're right it looks like the $data string from curl is messed up a little. I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line). Thanks I'll give that a try. I ran into another problem I would like get the item number, price and description out of the <item></item> tags. There's multiple items in the feed. And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser). Plus there's all kinds of html in the actual feed that I'm going to have to sift through anyway.

I just copied a chunk of the feed into the data variable this time, with out using curl. and it worked.

I'm trying to build the second part of the regular expression like this (I direct copy/pasted this so if there's any errors in the $data variable hopefully it'll be evident):
[code]$data =        <item>
.
        <title>
$169.00 - AMD Athlon 64 X2 3800+ Windsor 2.0GHz Socket AM2 Dual Core
Processor Model ADA3800CUBOX - Retail
</title>
/*
* More junk in here
*/
                                        <div
style="margin-bottom:15px;"><strong>Item #:</strong>
N82E16819103735</div>
                                        <div><strong>Price:</strong>
<span style="color:red;font-weight:bold;">$169.00</span></div>
/*
*More here
*/
</item>;

        preg_match_all( '/Item #:<\/strong>(.*?)<\/div>.+?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>/is', $data, $productid );
     
print_r($productid);[/code]

Then I should get an array with the price and item number. Although this is all I get now:
[code]Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )[/code]

It looks like I need to clean up $data to make it work better if I'm getting it through curl. How can I do that? dwees, I tried str_replace, same problem, I think it may be a problem with the regexp. (I've gotten similar stuff to work on html feeds, I'm wondering why xml is being so onry.) Thank you both very much for the help!
Link to comment
Share on other sites

It just dawned on me when I was editing some of the junk out of the $data variable for that last post. I might be being too vague in the regexp. I did get just the Item number grab to work. But for multiple sub-query's, since I only want data between the <item> tags, maybe it should be more like this:

Patern:
[code]'/<item>.*?<title>.*?\$[0-9]+[.][0-9]{2} - (.*?)<\/title>.*?Item #:<\/strong>(.*?)<\/div>.*?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>.*?<\/item>/is'[/code]

This should give me an array something like this:
[code]Array (
  [0] => Array (Whole query, each match)
  [1] => Array (Item Description between the title tags w/o the price, each match)
  [2] => Array (Item Number, each match)
  [3] => Array (Price, each match)
)[/code]
Another question on the side. I'm trying to understand the '?' symbol's (not a pun!) function. According to the RegExLib.com definition, "0 or 1 of the previous expression; also forces minimal matching when an expression might match several strings within a search string." So essentially the '.*?' character set matches as few characters as possible until the next explicit string I define (in the example above, first occurance, the <title> tag). Is that correct?
Link to comment
Share on other sites

Bear with me here:

I tried the patern above. And again nothing. I think the problem lies with the data variable. I added a
[code]$data = '';

$ch = curl_init($url);
ob_start();
curl_exec($ch);
curl_close($ch);
$data = ob_get_contents();
ob_end_clean();

print_r($data);[/code]
to see what was comming out, it looked fine on the page, but I took a look at the source, and here's a chunk of what came out:
[code] <item>
<title><![CDATA[$185.00 - AMD Athlon 64 X2 3800+ Manchester 2.0GHz Socket 939 Dual Core Processor Model ADA3800BVBOX - Retail]]></title>
<link>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&amp;CMP=OTC-RSS</link>
<description>
&lt;div style=&quot;width:125px;float:left;clear:none;border:1px solid #ccc;background-color:#fff;padding:15px 5px;margin:10px 10px 10px 0px;&quot;&gt;

&lt;img border=&quot;0&quot; src=&quot;http://images10.newegg.com/NeweggImage/ProductImageCompressAll125/19-103-562-01.jpg&quot; width=&quot;125&quot; title=&quot;&quot; alt=&quot;&quot; /&gt;
&lt;/div&gt;

&lt;p style=&quot;margin:15px;&quot;&gt;
&lt;div&gt;&lt;strong&gt;Model #:&lt;/strong&gt; ADA3800BVBOX&lt;/div&gt;
&lt;div style=&quot;margin-bottom:15px;&quot;&gt;&lt;strong&gt;Item #:&lt;/strong&gt; N82E16819103562&lt;/div&gt;

&lt;div&gt;&lt;strong&gt;Price:&lt;/strong&gt; &lt;span style=&quot;color:red;font-weight:bold;&quot;&gt;$185.00&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;&lt;a href=&quot;http://secure.newegg.com/NewVersion/Shopping/AddToCart.asp?submit=ADD&amp;ItemList=N82E16819103562&amp;CMP=OTC-RSS&quot; target=&quot;NEWEGGCART&quot;&gt;Add To Cart&lt;/a&gt;&lt;/div&gt;

&lt;/p&gt;
</description>
<pubDate>Fri, 13 Oct 2006 10:19:08 GMT</pubDate>
<guid>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&amp;CMP=OTC-RSS</guid>
<comments>http://www.newegg.com/Product/CustratingReview.asp?Item=N82E16819103562&amp;CMP=OTC-RSS</comments>
</item>[/code]

Rut-roh. Its taking out all the html entities, and specifically ones that I'm counting on being there for my regexp! How do I stop/fix this?
Link to comment
Share on other sites

[quote]I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line).[/quote]

[tt]/s [/tt]includes a new line in[tt] .[/tt].

[quote]And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser).[/quote]

Use the right tool for the job. It sounds like you need a SAX parser.

[quote]I'm trying to understand the '?' symbol's (not a pun!) function.[/quote]

Modifiers are greedy by default--they try to match as much as possible. [tt]?[/tt]'s make a modifier ungreedy.


Do the entities go away if you use echo instead of print_r?


Link to comment
Share on other sites

No actually, echo produces the same results. I'm not entirely sure its curl's fault though, because it only changes out a few of the entites, and keeps all the others. So maybe I'll just have to go back to html scraping! (I'd like to stick with XML though, more resiliant to change)
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.