mhykhh Posted June 7, 2008 Share Posted June 7, 2008 Hey guys, I'd given up on experimenting with different patterns. The goal here is to parse raw wikipedia text to be able to get the different sections and their respective text. A sample raw output from wikipedia: ==Early life== Gates was born in [[seattle]], [[Washington]], to [[William H. Gates, Sr.]] and [[Mary Maxwell Gates]]. His family was wealthy; his father was a prominent lawyer, his mother served on the board of directors for [[First Interstate BancSystem]] and the [[united Way of America|United Way]], and her father, J. W. Maxwell, was a [[National bank#United States|national bank]] president. Gates has one older sister, Kristi (Kristianne), and one younger sister, Libby. He was the fourth of his name in his family, but was known as William Gates III or "Trey" because his father had dropped his own "III" suffix.{{harv|Manes|1994|p=15}} Early on in his life, Gates' parents had a law career in mind for him.{{harv|Manes|1994|p=47}} And the most effective pattern I've been using is: /[=]{2,6}([^==]+)[=]{2,6}([^==]+)/is One problem is that even single "=" is being parsed and is eventually cut off (see the last part of the raw text, {{harv|Manes|1994|p=47}}). The other problem is that headings can have two or more "=" so the [^==] in the pattern could cause problems later on. TIA Quote Link to comment Share on other sites More sharing options...
effigy Posted June 9, 2008 Share Posted June 9, 2008 Will this suffice? <pre> <?php $data = <<<DATA ==A== A stuff ==B== B stuff ==C== C stuff DATA; $pieces = preg_split('/^\s*==([^=]+)==\s*$/m', $data, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE); print_r($pieces); ?> </pre> Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.