Jump to content

Regexp which will work as XML Parser


OsvaldoM

Recommended Posts

Hi all, hope you can help me out in this one, have been struggling all day with this issue:

 

We get an XML feed from a third party and sometimes the feed is corrupt, though we still need to extract as much info as we can from it, obviously being XML a very strict language, trying to parse it with any PHP-XML libraries will not work, and cleanup libraries such as Tidy HTML haven't done the job, so I am trying to manually break the feed in parts, then process it.

 

We usually get stuff like this:

 

<?xml version="1.0" encoding="utf-8"?>
<articles extbatchid="877" nextextbatchid="903" profileid="2012234">
<article articleid="1141402135">
<url>http://www.courant.com/features/bal-friends-sweeps0429,0,7443638.story</url>
<headline_text>TV lines up big doings, grand finales for the sweeps</headline_text>
<outlet>Hartford Courant</outlet>
<influential>By Hal Boedeker</influential>
<language>English</language>
<country>United States</country>
<publish_date>2004-04-29 12:45:29 UTC</publish_date>
<extract>Last \'Friends\' episode, \'Idol\' conclusion will drive May...</extract>
</article>
<article articleid="114140sdfsdf2135">
<url>http://www.mysite.com/y</url>
<headline_text>Osvaldo makes the headlines</headline_text>
<outlet>Dont Know</outlet>
<influential>By Hsfsdfsdfsdf</influential>
<language>English</language>
<country>MEXICO</country>
<publish_date>2004-04-29 12:45:29 UTC</publish_date>
<extract>HELLLLLLLLLLLOOOOOOOOOOOOOOO!!</extract>
</article>
<article Bad, broken down>This is a bad article<arcle>

What i am trying to accomplish is to make a regex for preg_match_all that will break down the article info into an array, and each key of it will hold all the article info, e.g:

 

array 
[0] => <article articleid="1141402135"> EVERYTHING IN BETWEEN </article>
[1] => <article articleid="112345677"> EVERYTHING IN BETWEEN </article>
[2] => <article articleid="123353457"> EVERYTHING IN BETWEEN </article>

 

I have already accomplished to get everything between two tags with

preg_match_all('/<tag(.*)(.*)?<\/tag>/', $articlesData, $pieces);

 

which works fine with most of the tags, except the one I really need:

preg_match_all('/<article(.*)(.*)?<\/article>/', $articlesData, $pieces);

 

the problem is that if I ran the above code i will get everything from the parent node <articles>, instead of the child <article>, i haven't been able to apply the proper "/b" nor to actually get closer to what i need.

 

Any help is highly appreciated, thanks!

 

Link to comment
Share on other sites

My first suggest would be to not accept invalid XML; to get the XML provider to fix their broken feed. 

 

However, to focus on the issue that you're having you need to look at the greedy/lazy behaviour of quantifiers (like * for zero-or-more, + for one or more, {3,6} for three-to-six [inclusive]) in your regular expression. .* is greedy (it will match as much as possible) whilst .*? is lazy (it will match only as much as is necessary).

 

More info: http://php.net/regexp.reference.repetition

Link to comment
Share on other sites

My first suggest would be to not accept invalid XML; to get the XML provider to fix their broken feed. 

I know, we used to do this before, though the feeds we get are quite large and if we escaped one of the feeds the data loss was considerable... anyways, thanks for the link, reading at the moment...

Link to comment
Share on other sites

Is there any consistency in "how" the data is corrupted? It might be easier to fix the corruption than to build your own XML parser.

The main two issues are: unclosed tags or feed isn't complete and illegal characters make the feed XML un-parsable. The feed comes in several languages and the guys which send us this feed apparently don't know what C-DATA and validation is...

Link to comment
Share on other sites

Just so you know, i was able to get what i want:

 

preg_match_all('~<article .*(.*)</article>~isU', $articlesData, $pieces);

 

notice the space after the word article... that was the trick. I am know building a regexp which will erase espaces between tags "> <" should be "><".

Also do notice that preg_match_all might not be the best idea if  you are looking for good performance of your query, 500 articles almost crash my firefox, good thing this is for a cron job... Hope the above code help someone out!

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.