Jump to content

Parsing HTML from WikiPedia


DIM3NSION

Recommended Posts

Hi guys. I have been using the wikipedia API to retrieve information about a topic. Ive managed to get a response and retrieve the first section of the topic (in this case football)

 

Using this method - http://en.wikipedia.org/w/api.php?action=parse&page='.$search.'&redirects=1&format=json&prop=text&section=0');

 

However the first section that is retrieved includes the pictures and i just want to main text from the introduction.

 

The code that is sent back from wiki is this -

Array
(
    [parse] => Array
        (
            [text] => Array
                (
                    [*] => <div class="dablink">This article is about sports known as football.  For the ball used in these sports, see <a href="/wiki/Football_(ball)">Football (ball)</a>.</div> 
<div class="thumb tright"> 
<div class="thumbinner" style="width:227px;"><a href="/wiki/File:Football4.png" class="image"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Football4.png/225px-Football4.png" width="225" height="274" class="thumbimage" /></a> 
<div class="thumbcaption"> 
<div class="magnify"><a href="/wiki/File:Football4.png" class="internal" title="Enlarge"><img src="http://bits.wikimedia.org/skins-1.17/common/images/magnify-clip.png" width="15" height="11" alt="" /></a></div> 
Some of the many different games known as football. From top left to bottom right: <a href="/wiki/Association_football">Association football</a> or soccer, <a href="/wiki/Australian_rules_football">Australian rules football</a>, <a href="/wiki/International_rules_football">International rules football</a>, <a href="/wiki/Rugby_Union" class="mw-redirect" title="Rugby Union">Rugby Union</a>, <a href="/wiki/Rugby_League" class="mw-redirect" title="Rugby League">Rugby League</a>, and <a href="/wiki/American_Football" class="mw-redirect" title="American Football">American Football</a>.</div> 
</div> 
</div> 
<p>The game of <b>football</b> is any of several similar <a href="/wiki/Team_sport" title="Team sport">team sports</a>, of similar origins which involve advancing a ball into a goal area in an attempt to score. Many of these involve <a href="/wiki/Kick_(football)" title="Kick (football)">kicking</a> a ball with the foot to score a <a href="/wiki/Goal_(sport)" title="Goal (sport)">goal</a>, though not all codes of football using kicking as a primary means of advancing the ball or scoring. The most popular of these sports worldwide is <a href="/wiki/Association_football">association football</a>, more commonly known as just "football" or "soccer". Unqualified, the word <i><a href="/wiki/Football_(word)" title="Football (word)">football</a></i> applies to whichever form of football is the most popular in the regional context in which the word appears, including <a href="/wiki/American_football">American football</a>, <a href="/wiki/Australian_rules_football">Australian rules football</a>, <a href="/wiki/Canadian_football">Canadian football</a>, <a href="/wiki/Gaelic_football">Gaelic football</a>, <a href="/wiki/Rugby_league">rugby league</a>, <a href="/wiki/Rugby_union">rugby union</a> and other related games. These variations are known as "codes".</p> 

 

 

I want the code that resides in the <p> tags. How would i go about parsing this and removing the rest. ive tried to get to work simple html dom parser but with no luck.

 

Any help would be greatly appreciated  8)

 

Thanks,

 

DIM3NSION

Link to comment
https://forums.phpfreaks.com/topic/235432-parsing-html-from-wikipedia/
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.