c4onastick Posted October 13, 2006 Share Posted October 13, 2006 Hi, I'm trying to get some data from an RSS feed into a more useable format. I'd like to extract the item number (N82E168... etc.) and the price ($249.99) from the XML.Here's the part of the feed I'm looking at:[code]<strong>Item #:</strong> N82E16819103540</div> <div><strong>Price:</strong> <span style="color:red;font-weight:bold;">$249.99</span></div>[/code]I'm getting the feed via curl, which is working fine. I need a little help with the regular expression part.Here's what I've got, I tested it at http://regexlib.com/RETester.aspx and it worked fine there.[code]$ch = curl_init($url);ob_start();curl_exec($ch);curl_close($ch);$data = ob_get_contents();ob_end_clean();preg_match( '/Item #:<\/strong> (.*?)<\/div>/is', $data, $productid );echo $productid[1];[/code]Now this should just give echo that item number, but I don't get anything out of it. Thanks in advance for the help! Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/ Share on other sites More sharing options...
effigy Posted October 13, 2006 Share Posted October 13, 2006 That should work. Are you sure[tt] $data [/tt]contains what you think it does? Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-108337 Share on other sites More sharing options...
dwees Posted October 13, 2006 Share Posted October 13, 2006 To make your life a bit easier why don't you use:[code]$data = str_replace ( ‘ ‘, ‘’, $data); [/code]first to make sure you aren't trying to match any spaces that don't exist. Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-108353 Share on other sites More sharing options...
c4onastick Posted October 13, 2006 Author Share Posted October 13, 2006 Thanks, I did get that to work. But you're right it looks like the $data string from curl is messed up a little. I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line). Thanks I'll give that a try. I ran into another problem I would like get the item number, price and description out of the <item></item> tags. There's multiple items in the feed. And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser). Plus there's all kinds of html in the actual feed that I'm going to have to sift through anyway.I just copied a chunk of the feed into the data variable this time, with out using curl. and it worked.I'm trying to build the second part of the regular expression like this (I direct copy/pasted this so if there's any errors in the $data variable hopefully it'll be evident):[code]$data = <item>. <title>$169.00 - AMD Athlon 64 X2 3800+ Windsor 2.0GHz Socket AM2 Dual CoreProcessor Model ADA3800CUBOX - Retail</title>/** More junk in here*/ <divstyle="margin-bottom:15px;"><strong>Item #:</strong>N82E16819103735</div> <div><strong>Price:</strong><span style="color:red;font-weight:bold;">$169.00</span></div>/**More here*/</item>; preg_match_all( '/Item #:<\/strong>(.*?)<\/div>.+?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>/is', $data, $productid ); print_r($productid);[/code]Then I should get an array with the price and item number. Although this is all I get now:[code]Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )[/code]It looks like I need to clean up $data to make it work better if I'm getting it through curl. How can I do that? dwees, I tried str_replace, same problem, I think it may be a problem with the regexp. (I've gotten similar stuff to work on html feeds, I'm wondering why xml is being so onry.) Thank you both very much for the help! Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-108422 Share on other sites More sharing options...
c4onastick Posted October 13, 2006 Author Share Posted October 13, 2006 It just dawned on me when I was editing some of the junk out of the $data variable for that last post. I might be being too vague in the regexp. I did get just the Item number grab to work. But for multiple sub-query's, since I only want data between the <item> tags, maybe it should be more like this:Patern:[code]'/<item>.*?<title>.*?\$[0-9]+[.][0-9]{2} - (.*?)<\/title>.*?Item #:<\/strong>(.*?)<\/div>.*?Price:<\/strong><span.*>\$([0-9]+.[0-9]{2})<\/span>.*?<\/item>/is'[/code]This should give me an array something like this:[code]Array ( [0] => Array (Whole query, each match) [1] => Array (Item Description between the title tags w/o the price, each match) [2] => Array (Item Number, each match) [3] => Array (Price, each match))[/code]Another question on the side. I'm trying to understand the '?' symbol's (not a pun!) function. According to the RegExLib.com definition, "0 or 1 of the previous expression; also forces minimal matching when an expression might match several strings within a search string." So essentially the '.*?' character set matches as few characters as possible until the next explicit string I define (in the example above, first occurance, the <title> tag). Is that correct? Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-108462 Share on other sites More sharing options...
c4onastick Posted October 13, 2006 Author Share Posted October 13, 2006 Bear with me here:I tried the patern above. And again nothing. I think the problem lies with the data variable. I added a[code]$data = '';$ch = curl_init($url);ob_start();curl_exec($ch);curl_close($ch);$data = ob_get_contents();ob_end_clean();print_r($data);[/code]to see what was comming out, it looked fine on the page, but I took a look at the source, and here's a chunk of what came out:[code] <item> <title><![CDATA[$185.00 - AMD Athlon 64 X2 3800+ Manchester 2.0GHz Socket 939 Dual Core Processor Model ADA3800BVBOX - Retail]]></title> <link>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&CMP=OTC-RSS</link> <description> <div style="width:125px;float:left;clear:none;border:1px solid #ccc;background-color:#fff;padding:15px 5px;margin:10px 10px 10px 0px;"> <img border="0" src="http://images10.newegg.com/NeweggImage/ProductImageCompressAll125/19-103-562-01.jpg" width="125" title="" alt="" /> </div> <p style="margin:15px;"> <div><strong>Model #:</strong> ADA3800BVBOX</div> <div style="margin-bottom:15px;"><strong>Item #:</strong> N82E16819103562</div> <div><strong>Price:</strong> <span style="color:red;font-weight:bold;">$185.00</span></div> <div><a href="http://secure.newegg.com/NewVersion/Shopping/AddToCart.asp?submit=ADD&ItemList=N82E16819103562&CMP=OTC-RSS" target="NEWEGGCART">Add To Cart</a></div> </p> </description> <pubDate>Fri, 13 Oct 2006 10:19:08 GMT</pubDate> <guid>http://www.newegg.com/Product/Product.asp?Item=N82E16819103562&CMP=OTC-RSS</guid> <comments>http://www.newegg.com/Product/CustratingReview.asp?Item=N82E16819103562&CMP=OTC-RSS</comments> </item>[/code]Rut-roh. Its taking out all the html entities, and specifically ones that I'm counting on being there for my regexp! How do I stop/fix this? Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-108487 Share on other sites More sharing options...
effigy Posted October 13, 2006 Share Posted October 13, 2006 [quote]I though that you didn't need to replace the spaces if you use the /s flag (makes it treat it as a single line).[/quote][tt]/s [/tt]includes a new line in[tt] .[/tt].[quote]And I figured I'd do it with regular expressions since I'm a little weak on them and they seem like such a powerful tool (instead of the xml parser).[/quote]Use the right tool for the job. It sounds like you need a SAX parser.[quote]I'm trying to understand the '?' symbol's (not a pun!) function.[/quote]Modifiers are greedy by default--they try to match as much as possible. [tt]?[/tt]'s make a modifier ungreedy.Do the entities go away if you use echo instead of print_r? Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-108575 Share on other sites More sharing options...
c4onastick Posted October 13, 2006 Author Share Posted October 13, 2006 No actually, echo produces the same results. I'm not entirely sure its curl's fault though, because it only changes out a few of the entites, and keeps all the others. So maybe I'll just have to go back to html scraping! (I'd like to stick with XML though, more resiliant to change) Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-108641 Share on other sites More sharing options...
dwees Posted October 18, 2006 Share Posted October 18, 2006 Isn't there a php function that converts HTML entities back into the characters you'd like to see? If so, then you can just run the $data through that function first.Dave Quote Link to comment https://forums.phpfreaks.com/topic/23811-pulling-data-from-an-xml-rss-feed-with-preg_match/#findComment-110496 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.