Jump to content

Recommended Posts

Ok. I've been going at this for awhile before giving up and posting. A document I'm parsing has sections which contain items in addition to various useless data. I'm trying to search a document for these sections (which have their own regular expression) and then find all the items, grouped by section. The only solution I have found was the obvious (and boring) solution to preg_match_all the section heading and footer and match anything between that. Then using a for-loop, traverse each match and preg_match_all again. What I would like to do is do this in one step, if not just to satisfy my curiosity and know it can be done.

 

So far I've been able to only just match the last item.

 

An example is in order. I have a document which is similiar to the following:

 

here lies useless data

 

<section>

<item>apple</item>

<item>orange</item>

<item>pear</item>

</section>

 

here lies more useless data

 

<section>

<item>potatoe</item>

<item>broccoli</item>

<item>onion</item>

</section>

 

useless data piled high and deep

 

Now, let's say this document is contained in the $doc variable. My code is as follows:

 

 

preg_match_all("/<section>(\\s+<item>([a-z]+)<\\/item>\\s+)+<\\/section>/s", $doc, $matches);

print_r($matches);

 

 

Output:

Array

(

    [0] => Array

        (

            [0] => <section>

<item>apple</item>

<item>orange</item>

<item>pear</item>

</section>

            [1] => <section>

<item>potatoe</item>

<item>broccoli</item>

<item>onion</item>

</section>

        )

 

    [1] => Array

        (

            [0] => <item>pear</item>

 

            [1] => <item>onion</item>

 

        )

 

    [2] => Array

        (

            [0] => pear

            [1] => onion

        )

 

)

 

As you can see, preg_match_all only returns the last item per section. Is there any way to have it return all the items in each section?

 

 

Thanks for you help

It can't be done with a single pattern, because the group count in a regular expression is fixed. You could argue that a pattern like

 

~foo(bar){3}~

 

has a fixed group count of three, and therefore should be able to capture three occurrences of bar, but it just doesn't work that way. So you have to start the good ol' regex engine at least twice. But you'd be better off using a foreach loop on the array of matches from the first preg_match_all() call, instead of a for loop ('cause that's what foreach loops are for :)).

I would use DOM / XPath instead of regex:

 

$doc = <<<EOF
here lies useless data

<section>
   <item>apple</item>
   <item>orange</item>
   <item>pear</item>
</section>

here lies more useless data

<section>
   <item>potatoe</item>
   <item>broccoli</item>
   <item>onion</item>
</section>

useless data piled high and deep
EOF;


$dom = new DOMDocument;
@$dom->loadHTML($doc); // you can change loadHTML to loadHTMLFile and use a url in single quotes in the parenthesis
$xpath = new DOMXPath($dom);
$itemTag = $xpath->query('//item');

foreach ($itemTag as $val) {
       echo $val->nodeValue . "<br /> ";
}

 

Output:

apple
orange
pear
potatoe
broccoli
onion

@nrg_alpha

Good call - I keep forgetting about the DOM approach :) But your code grabs every item tag, right? Also if it doesn't appear as a child to a section tag. I'm sure that could be easily solved, though.

@nrg_alpha

Good call - I keep forgetting about the DOM approach :) But your code grabs every item tag, right? Also if it doesn't appear as a child to a section tag. I'm sure that could be easily solved, though.

 

Yes, it grabs all item values.. We can force it to select only items that are children of section:

 

$itemTag = $xpath->query('//section/item'); // select all item tags that are children of section

Thanks for the reply. However, the actual document isn't an XML document. It's slightly more complicated than that.  I used XML as a way to illustrate the problem. The section header and footer can vary in syntax and is found by a regular expression, as is the items.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.