Jump to content

Preg_match crawling from HTML tags


htorbov

Recommended Posts

I have little preg_match problem on my new torrent crawler (using cURL and preg_match):

 

crawler.php

$pattern = "/Category<\/td><td[^>]>(.*?)<\/td>/s";
preg_match($pattern, $contents, $categorymatches);
if(!isset($categorymatches[1])) {
echo "FAILED TO MATCH CATEGORY!";
return FALSE;
}

 

Returns:

FAILED TO MATCH CATEGORY!

 

HTML that needs to be crawled (the word MOVIES):

Type</td><td valign="top" align=left>MOVIES</td>

 

Any help is appriciated!

 

Link to comment
Share on other sites

category in you pattern will never match type in the string.

 

do you want to match anything in between table columns?

 

if so...

 

$str = "Type</td><td valign='top' align=left>MOVIES</td>";
$pattern = "~<td(?>[^>]+)>((?>[^<]+))</td>~";
preg_match($pattern,$str,$ms);
print_r($ms);

 

if not, specify the requirements more thoroughly.

Link to comment
Share on other sites

Thanks alot, but I also want to set the "Category" in the pattern, for example:

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "/Category<td(?>[^>]+)>((?>[^<]+))<\/td>/s";
preg_match($pattern,$str,$ms);
print_r($ms);

 

But it gives me an empty array..

 

EDIT: In the last post I've made an mistake, it's not Type, it's also Category :)

Link to comment
Share on other sites

So, in a few words, I want this:

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "/Category<td(?>[^>]+)>((?>[^<]+))<\/td>/s";
preg_match($pattern,$str,$ms);
print_r($ms);

 

To returns:

Array ( [0] => MOVIES [1] => MOVIES )

 

But it returns:

Array ( )

 

Link to comment
Share on other sites

Thanks alot, but I also want to set the "Category" in the pattern, for example:

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "/Category<td(?>[^>]+)>((?>[^<]+))<\/td>/s";
preg_match($pattern,$str,$ms);
print_r($ms);

 

But it gives me an empty array..

 

EDIT: In the last post I've made an mistake, it's not Type, it's also Category :)

 

ah,

 

$str = "Category</td><td valign='top' align=left>MOVIES</td>";
$pattern = "~Category</td><td(?>[^>]+)>((?>[^<]+))</td>~";
preg_match($pattern,$str,$ms);
print_r($ms);

 

$ms[1] will hold the captured value, in this case "MOVIES"

Link to comment
Share on other sites

Hi, thanks alot for the help - it works now, but only if I put it in a new file.

 

Here's what's going on if I use it on the crawler..

// ... the cURL codes (they're working) ...
// Content of the Page
$contents = curl_exec($crawler->curl);

// Find the Title
$pattern = "/<title>(.*?)<\/title>/s";
preg_match($pattern, $contents, $titlematches);
echo $titlematches[1]."<br/>";

// Find the Category
$pattern = "~Тип</td><td(?>[^>]+)>((?>[^<]+))</td>~";
preg_match($pattern, $contents, $categorymatches);
echo $categorymatches[1]."<br/>";

 

The HTML page: ("Тип" means Category and "Филми" means Movies)

<title>The Matrix</title>
<!--Some Codes Here--!>
<tr><td>Тип</td><td valign="top" align=left>Филми</td></tr>
<!--Some Codes Here--!>

 

The result:

The Matrix
Notice: Undefined offset: 1 in /var/www/spider.php on line 117

 

Very strange! It's showing the title but not the category..

I've tried to echo $categorymatches[0], $categorymatches[2], $categorymatches[3] without any luck.

 

Link to comment
Share on other sites

matches for me when tested..

 

$str = '<td>Тип</td><td valign="top" align=left>Филми</td></tr>';
$pattern = '~Тип</td><td(?>[^>]+)>((?>[^<]+))</td>~';
preg_match($pattern,$str,$ms);
print_r($ms);

 

results:

 

Array
(
    [0] => Тип</td><td valign='top' align=left>Филми</td>
    [1] => Филми
)

Link to comment
Share on other sites

most likely a charset issue, if you want to view the transfer results as a string, the CURLOPT_RETURNTRANSFER option needs to be set.

 

$ch = curl_init("http://www.test.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
$transf = curl_exec($ch);
curl_close($ch);
if($transf !== false)
    var_dump($transf);

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.