Regexp which will work as XML Parser

OsvaldoM · June 9, 2010

Hi all, hope you can help me out in this one, have been struggling all day with this issue:

We get an XML feed from a third party and sometimes the feed is corrupt, though we still need to extract as much info as we can from it, obviously being XML a very strict language, trying to parse it with any PHP-XML libraries will not work, and cleanup libraries such as Tidy HTML haven't done the job, so I am trying to manually break the feed in parts, then process it.

We usually get stuff like this:

<?xml version="1.0" encoding="utf-8"?>
<articles extbatchid="877" nextextbatchid="903" profileid="2012234">
<article articleid="1141402135">
<url>http://www.courant.com/features/bal-friends-sweeps0429,0,7443638.story</url>
<headline_text>TV lines up big doings, grand finales for the sweeps</headline_text>
<outlet>Hartford Courant</outlet>
<influential>By Hal Boedeker</influential>
<language>English</language>
<country>United States</country>
<publish_date>2004-04-29 12:45:29 UTC</publish_date>
<extract>Last \'Friends\' episode, \'Idol\' conclusion will drive May...</extract>
</article>
<article articleid="114140sdfsdf2135">
<url>http://www.mysite.com/y</url>
<headline_text>Osvaldo makes the headlines</headline_text>
<outlet>Dont Know</outlet>
<influential>By Hsfsdfsdfsdf</influential>
<language>English</language>
<country>MEXICO</country>
<publish_date>2004-04-29 12:45:29 UTC</publish_date>
<extract>HELLLLLLLLLLLOOOOOOOOOOOOOOO!!</extract>
</article>
<article Bad, broken down>This is a bad article<arcle>

What i am trying to accomplish is to make a regex for preg_match_all that will break down the article info into an array, and each key of it will hold all the article info, e.g:

array 
[0] => <article articleid="1141402135"> EVERYTHING IN BETWEEN </article>
[1] => <article articleid="112345677"> EVERYTHING IN BETWEEN </article>
[2] => <article articleid="123353457"> EVERYTHING IN BETWEEN </article>

I have already accomplished to get everything between two tags with

preg_match_all('/<tag(.*)(.*)?<\/tag>/', $articlesData, $pieces);

which works fine with most of the tags, except the one I really need:

preg_match_all('/<article(.*)(.*)?<\/article>/', $articlesData, $pieces);

the problem is that if I ran the above code i will get everything from the parent node <articles>, instead of the child <article>, i haven't been able to apply the proper "/b" nor to actually get closer to what i need.

Any help is highly appreciated, thanks!

salathe · June 10, 2010

My first suggest would be to not accept invalid XML; to get the XML provider to fix their broken feed.

However, to focus on the issue that you're having you need to look at the greedy/lazy behaviour of quantifiers (like * for zero-or-more, + for one or more, {3,6} for three-to-six [inclusive]) in your regular expression. .* is greedy (it will match as much as possible) whilst .*? is lazy (it will match only as much as is necessary).

More info: http://php.net/regexp.reference.repetition

OsvaldoM · June 10, 2010

My first suggest would be to not accept invalid XML; to get the XML provider to fix their broken feed.

I know, we used to do this before, though the feeds we get are quite large and if we escaped one of the feeds the data loss was considerable... anyways, thanks for the link, reading at the moment...

Psycho · June 10, 2010

Is there any consistency in "how" the data is corrupted? It might be easier to fix the corruption than to build your own XML parser.

OsvaldoM · June 10, 2010

Is there any consistency in "how" the data is corrupted? It might be easier to fix the corruption than to build your own XML parser.

The main two issues are: unclosed tags or feed isn't complete and illegal characters make the feed XML un-parsable. The feed comes in several languages and the guys which send us this feed apparently don't know what C-DATA and validation is...

OsvaldoM · June 11, 2010

Just so you know, i was able to get what i want:

preg_match_all('~<article .*(.*)</article>~isU', $articlesData, $pieces);

notice the space after the word article... that was the trick. I am know building a regexp which will erase espaces between tags "> <" should be "><".

Also do notice that preg_match_all might not be the best idea if you are looking for good performance of your query, 500 articles almost crash my firefox, good thing this is for a cron job... Hope the above code help someone out!

Sign In

Regexp which will work as XML Parser

Recommended Posts

OsvaldoM

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

OsvaldoM

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

OsvaldoM

Link to comment

Share on other sites

OsvaldoM

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information