Jump to content

Parsing wikipedia text


mhykhh

Recommended Posts

Hey guys, I'd given up on experimenting with different patterns. The goal here is to parse raw wikipedia text to be able to get the different sections and their respective text.

 

A sample raw output from wikipedia:

==Early life==
Gates was born in [[seattle]], [[Washington]], to [[William H. Gates, Sr.]] and [[Mary Maxwell Gates]]. His family was wealthy; his father was a prominent lawyer, his mother served on the board of directors for [[First Interstate BancSystem]] and the [[united Way of America|United Way]], and her father, J. W. Maxwell, was a [[National bank#United States|national bank]] president. Gates has one older sister, Kristi (Kristianne), and one younger sister, Libby. He was the fourth of his name in his family, but was known as William Gates III or "Trey" because his father had dropped his own "III" suffix.{{harv|Manes|1994|p=15}} Early on in his life, Gates' parents had a law career in mind for him.{{harv|Manes|1994|p=47}}

 

And the most effective pattern I've been using is:

/[=]{2,6}([^==]+)[=]{2,6}([^==]+)/is

 

One problem is that even single "=" is being parsed and is eventually cut off (see the last part of the raw text, {{harv|Manes|1994|p=47}}).

The other problem is that headings can have two or more "=" so the [^==] in the pattern could cause problems later on.

 

TIA

Link to comment
Share on other sites

Will this suffice?

 

<pre>
<?php
$data = <<<DATA
==A==
A stuff
==B==
B stuff
==C==
C stuff
DATA;
$pieces = preg_split('/^\s*==([^=]+)==\s*$/m', $data, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
print_r($pieces);
?>
</pre>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.