Parsing wikipedia text

mhykhh · June 7, 2008

Hey guys, I'd given up on experimenting with different patterns. The goal here is to parse raw wikipedia text to be able to get the different sections and their respective text.

A sample raw output from wikipedia:

==Early life==
Gates was born in [[seattle]], [[Washington]], to [[William H. Gates, Sr.]] and [[Mary Maxwell Gates]]. His family was wealthy; his father was a prominent lawyer, his mother served on the board of directors for [[First Interstate BancSystem]] and the [[united Way of America|United Way]], and her father, J. W. Maxwell, was a [[National bank#United States|national bank]] president. Gates has one older sister, Kristi (Kristianne), and one younger sister, Libby. He was the fourth of his name in his family, but was known as William Gates III or "Trey" because his father had dropped his own "III" suffix.{{harv|Manes|1994|p=15}} Early on in his life, Gates' parents had a law career in mind for him.{{harv|Manes|1994|p=47}}

And the most effective pattern I've been using is:

/[=]{2,6}([^==]+)[=]{2,6}([^==]+)/is

One problem is that even single "=" is being parsed and is eventually cut off (see the last part of the raw text, {{harv|Manes|1994|p=47}}).

The other problem is that headings can have two or more "=" so the [^==] in the pattern could cause problems later on.

TIA

effigy · June 9, 2008

Will this suffice?

<pre>
<?php
$data = <<<DATA
==A==
A stuff
==B==
B stuff
==C==
C stuff
DATA;
$pieces = preg_split('/^\s*==([^=]+)==\s*$/m', $data, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
print_r($pieces);
?>
</pre>

Sign In

Parsing wikipedia text

Recommended Posts

mhykhh

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information