playairguitar Posted September 5, 2009 Share Posted September 5, 2009 Ok. I've been going at this for awhile before giving up and posting. A document I'm parsing has sections which contain items in addition to various useless data. I'm trying to search a document for these sections (which have their own regular expression) and then find all the items, grouped by section. The only solution I have found was the obvious (and boring) solution to preg_match_all the section heading and footer and match anything between that. Then using a for-loop, traverse each match and preg_match_all again. What I would like to do is do this in one step, if not just to satisfy my curiosity and know it can be done. So far I've been able to only just match the last item. An example is in order. I have a document which is similiar to the following: here lies useless data <section> <item>apple</item> <item>orange</item> <item>pear</item> </section> here lies more useless data <section> <item>potatoe</item> <item>broccoli</item> <item>onion</item> </section> useless data piled high and deep Now, let's say this document is contained in the $doc variable. My code is as follows: preg_match_all("/<section>(\\s+<item>([a-z]+)<\\/item>\\s+)+<\\/section>/s", $doc, $matches); print_r($matches); Output: Array ( [0] => Array ( [0] => <section> <item>apple</item> <item>orange</item> <item>pear</item> </section> [1] => <section> <item>potatoe</item> <item>broccoli</item> <item>onion</item> </section> ) [1] => Array ( [0] => <item>pear</item> [1] => <item>onion</item> ) [2] => Array ( [0] => pear [1] => onion ) ) As you can see, preg_match_all only returns the last item per section. Is there any way to have it return all the items in each section? Thanks for you help Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/ Share on other sites More sharing options...
thebadbad Posted September 5, 2009 Share Posted September 5, 2009 It can't be done with a single pattern, because the group count in a regular expression is fixed. You could argue that a pattern like ~foo(bar){3}~ has a fixed group count of three, and therefore should be able to capture three occurrences of bar, but it just doesn't work that way. So you have to start the good ol' regex engine at least twice. But you'd be better off using a foreach loop on the array of matches from the first preg_match_all() call, instead of a for loop ('cause that's what foreach loops are for ). Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/#findComment-913115 Share on other sites More sharing options...
nrg_alpha Posted September 5, 2009 Share Posted September 5, 2009 I would use DOM / XPath instead of regex: $doc = <<<EOF here lies useless data <section> <item>apple</item> <item>orange</item> <item>pear</item> </section> here lies more useless data <section> <item>potatoe</item> <item>broccoli</item> <item>onion</item> </section> useless data piled high and deep EOF; $dom = new DOMDocument; @$dom->loadHTML($doc); // you can change loadHTML to loadHTMLFile and use a url in single quotes in the parenthesis $xpath = new DOMXPath($dom); $itemTag = $xpath->query('//item'); foreach ($itemTag as $val) { echo $val->nodeValue . "<br /> "; } Output: apple orange pear potatoe broccoli onion Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/#findComment-913151 Share on other sites More sharing options...
thebadbad Posted September 5, 2009 Share Posted September 5, 2009 @nrg_alpha Good call - I keep forgetting about the DOM approach But your code grabs every item tag, right? Also if it doesn't appear as a child to a section tag. I'm sure that could be easily solved, though. Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/#findComment-913160 Share on other sites More sharing options...
Garethp Posted September 5, 2009 Share Posted September 5, 2009 Excuse me, but what's <<<EOF ? It's the second time I've seen it and it seems to replace the quotes that I'm used to Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/#findComment-913168 Share on other sites More sharing options...
thebadbad Posted September 5, 2009 Share Posted September 5, 2009 It's called heredoc syntax. Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/#findComment-913174 Share on other sites More sharing options...
nrg_alpha Posted September 5, 2009 Share Posted September 5, 2009 @nrg_alpha Good call - I keep forgetting about the DOM approach But your code grabs every item tag, right? Also if it doesn't appear as a child to a section tag. I'm sure that could be easily solved, though. Yes, it grabs all item values.. We can force it to select only items that are children of section: $itemTag = $xpath->query('//section/item'); // select all item tags that are children of section Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/#findComment-913183 Share on other sites More sharing options...
playairguitar Posted September 6, 2009 Author Share Posted September 6, 2009 Thanks for the reply. However, the actual document isn't an XML document. It's slightly more complicated than that. I used XML as a way to illustrate the problem. The section header and footer can vary in syntax and is found by a regular expression, as is the items. Quote Link to comment https://forums.phpfreaks.com/topic/173196-matching-and-retrieving-repeating-subpattern-occurences/#findComment-913380 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.