Jump to content

screen scraping help with preg_match


ngng

Recommended Posts

I'm having trouble screen scraping with preg_match, for example, I'm trying to pull everything from wikipedia between these tags:

 

<h5 style="white-space: nowrap;"> and </h5>

 

ideally, should return:

<h5 style="white-space: nowrap;"><label for="searchInput">
<span lang="en" xml:lang="en">Search</span> <b>·</b>
<span lang="de" xml:lang="de">Suche</span> <b>·</b>
<span lang="fr" xml:lang="fr">Rechercher</span> <b>·</b>
<span lang="pl" xml:lang="pl">Szukaj</span> <b>·</b>
<span lang="ja" xml:lang="ja" title="Kensaku">検索</span> <b>·</b>
<span lang="it" xml:lang="it">Ricerca</span> <b>·</b>
<span lang="nl" xml:lang="nl">Zoeken</span> <b>·</b>
<span lang="pt" xml:lang="pt">Busca</span> <b>·</b>
<span lang="es" xml:lang="es">Buscar</span><br />
<span lang="sv" xml:lang="sv">Sök</span> <b>·</b>
<span lang="ru" xml:lang="ru" title="Poisk">Поиск</span> <b>·</b>
<span lang="zh" xml:lang="zh" title="Sōusuǒ">搜索</span> <b>·</b>
<span lang="nb" xml:lang="nb">Søk</span> <b>·</b>
<span lang="fi" xml:lang="fi">Haku</span> <b>·</b>
<span lang="vo" xml:lang="vo">Suk</span> <b>·</b>
<span lang="ca" xml:lang="ca">Cerca</span> <b>·</b>
<span lang="ro" xml:lang="ro">Căutare</span> <b>·</b>
<span lang="tr" xml:lang="tr">Ara</span> <b>·</b>
<span lang="uk" xml:lang="uk" title="Pošuk">Пошук</span>
</label></h5>

 

I can't seem to get it to work. Yes, I know content changes and screen scraping is not the best way to do something, but for the sake of learning, I want to try this.

 

<?

$url = file_get_contents('http://wikipedia.org/');
$regex = '/\<h5(.*)\>\<\/h5\>/m';


// match 
preg_match($regex, $url, $output);

var_dump($output);

?>

 

Link to comment
https://forums.phpfreaks.com/topic/101483-screen-scraping-help-with-preg_match/
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.