Jump to content

Parsing wikipedia text


mhykhh

Recommended Posts

Hey guys, I'd given up on experimenting with different patterns. The goal here is to parse raw wikipedia text to be able to get the different sections and their respective text.

 

A sample raw output from wikipedia:

==Early life==
Gates was born in [[seattle]], [[Washington]], to [[William H. Gates, Sr.]] and [[Mary Maxwell Gates]]. His family was wealthy; his father was a prominent lawyer, his mother served on the board of directors for [[First Interstate BancSystem]] and the [[united Way of America|United Way]], and her father, J. W. Maxwell, was a [[National bank#United States|national bank]] president. Gates has one older sister, Kristi (Kristianne), and one younger sister, Libby. He was the fourth of his name in his family, but was known as William Gates III or "Trey" because his father had dropped his own "III" suffix.{{harv|Manes|1994|p=15}} Early on in his life, Gates' parents had a law career in mind for him.{{harv|Manes|1994|p=47}}

 

And the most effective pattern I've been using is:

/[=]{2,6}([^==]+)[=]{2,6}([^==]+)/is

 

One problem is that even single "=" is being parsed and is eventually cut off (see the last part of the raw text, {{harv|Manes|1994|p=47}}).

The other problem is that headings can have two or more "=" so the [^==] in the pattern could cause problems later on.

 

TIA

Link to comment
https://forums.phpfreaks.com/topic/109102-parsing-wikipedia-text/
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.