Jump to content

Preg_match crawling from HTML tags


htorbov

Recommended Posts

I have little preg_match problem on my new torrent crawler (using cURL and preg_match):

 

crawler.php

$pattern = "/Category<\/td><td[^>]>(.*?)<\/td>/s";
preg_match($pattern, $contents, $categorymatches);
if(!isset($categorymatches[1])) {
echo "FAILED TO MATCH CATEGORY!";
return FALSE;
}

 

Returns:

FAILED TO MATCH CATEGORY!

 

HTML that needs to be crawled (the word MOVIES):

Type</td><td valign="top" align=left>MOVIES</td>

 

Any help is appriciated!

 

Link to comment
https://forums.phpfreaks.com/topic/256896-preg_match-crawling-from-html-tags/
Share on other sites

category in you pattern will never match type in the string.

 

do you want to match anything in between table columns?

 

if so...

 

$str = "Type</td><td valign='top' align=left>MOVIES</td>";
$pattern = "~<td(?>[^>]+)>((?>[^<]+))</td>~";
preg_match($pattern,$str,$ms);
print_r($ms);

 

if not, specify the requirements more thoroughly.

Thanks alot, but I also want to set the "Category" in the pattern, for example:

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "/Category<td(?>[^>]+)>((?>[^<]+))<\/td>/s";
preg_match($pattern,$str,$ms);
print_r($ms);

 

But it gives me an empty array..

 

EDIT: In the last post I've made an mistake, it's not Type, it's also Category :)

So, in a few words, I want this:

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "/Category<td(?>[^>]+)>((?>[^<]+))<\/td>/s";
preg_match($pattern,$str,$ms);
print_r($ms);

 

To returns:

Array ( [0] => MOVIES [1] => MOVIES )

 

But it returns:

Array ( )

 

Thanks alot, but I also want to set the "Category" in the pattern, for example:

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "/Category<td(?>[^>]+)>((?>[^<]+))<\/td>/s";
preg_match($pattern,$str,$ms);
print_r($ms);

 

But it gives me an empty array..

 

EDIT: In the last post I've made an mistake, it's not Type, it's also Category :)

 

ah,

 

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "~Category</td><td(?>[^>]+)>((?>[^<]+))</td>~";
preg_match($pattern,$str,$ms);
print_r($ms);

 

$ms[1] will hold the captured value, in this case "MOVIES"

Hi, thanks alot for the help - it works now, but only if I put it in a new file.

 

Here's what's going on if I use it on the crawler..

// ... the cURL codes (they're working) ...
// Content of the Page
$contents = curl_exec($crawler->curl);

// Find the Title
$pattern = "/<title>(.*?)<\/title>/s";
preg_match($pattern, $contents, $titlematches);
echo $titlematches[1]."<br/>";

// Find the Category
$pattern = "~Тип</td><td(?>[^>]+)>((?>[^<]+))</td>~";
preg_match($pattern, $contents, $categorymatches);
echo $categorymatches[1]."<br/>";

 

The HTML page: ("Тип" means Category and "Филми" means Movies)

<title>The Matrix</title>
<!--Some Codes Here--!>
<tr><td>Тип</td><td valign="top" align=left>Филми</td></tr>
<!--Some Codes Here--!>

 

The result:

The Matrix
Notice: Undefined offset: 1 in /var/www/spider.php on line 117

 

Very strange! It's showing the title but not the category..

I've tried to echo $categorymatches[0], $categorymatches[2], $categorymatches[3] without any luck.

 

matches for me when tested..

 

$str = '<td>Тип</td><td valign="top" align=left>Филми</td></tr>';
$pattern = '~Тип</td><td(?>[^>]+)>((?>[^<]+))</td>~';
preg_match($pattern,$str,$ms);
print_r($ms);

 

results:

 

Array
(
    [0] => Тип</td><td valign='top' align=left>Филми</td>
    [1] => Филми
)

most likely a charset issue, if you want to view the transfer results as a string, the CURLOPT_RETURNTRANSFER option needs to be set.

 

$ch = curl_init("http://www.test.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
$transf = curl_exec($ch);
curl_close($ch);
if($transf !== false)
    var_dump($transf);

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.