Jump to content

Extracting an image src using a hyperlink as an anchor


Recommended Posts

Hello.

 

I'm trying to extract the image src value. The sample data I'm trying to extract the link from is:

<a name="poster" href="/rg/action-box-title/primary-photo/media/rm118396416/tt0811080" title="Speed Racer"><img border="0" alt="Speed Racer" title="Speed Racer" src="http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg" /></a>

 

The expected output should be ANYTHING in between the quotation marks of the src field. The regex expression will need to search for the <a name="poster" text to extract the correct image link value.

 

This is the code I have so far, which extracts other data from an IMDB page. I'm trying to extract the Poster Image link as well:

 

<?php

//url
$imdbcode = $_GET['code'];
$url = 'http://www.imdb.com/title/'.$imdbcode.'/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
preg_match($regex,$content,$matches);
return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}

?>

 

Cheers.

 

 

 

As a sidenote. I took a stab at it and came up with the following line which I placed under the "//parse for product name" comment:

$poster = get_match('/<a name="\/poster" [^>]*><img [^> src="(.*)" /></a>/isU',$imdb_content);

as well as added under the first instance of $content:

$content.= '<h2>Poster</h2><p>'.$poster.'</p>';

That yeilded the following error:

Warning: preg_match() [function.preg-match]: Unknown modifier '>' in /home/jurud/public_html/imdb.php on line 34

I'm guessing my regex expression isnt up to scratch.

Hello.

I'm trying to extract the image src value.

 

when parsing XML/HTML, its better to use dedicated classes/methods (if you have them) than constructing regex from scratch.

$string = '<a name="poster" href="/rg/action-box-title/primary-photo/media/rm118396416/tt0811080" title="Speed Racer"><img border="0" alt="Speed Racer" title="Speed Racer" src="http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg" /></a>';


if (  ($start = strpos($string,'<a name="poster"' ) ) !==FALSE ) {
    $xml = new SimpleXMLElement($string);
    echo $xml->img['src'];
}

Thank you both for your replies.

 

@DarkWater: I didn't write this code, and I have absoultely no experience with regex expressions but I figured out what u mean by escaping all the slashes so I came up with this:

$poster = get_match('/<a name="\/poster" [^>]*><img [^>]* src="(.*)" \/><\/a>/isU',$imdb_content);

 

That now doesn't show any error messages, but it also doesn't return anything either. So now I just need the help with figuring out how to correctly identify what Im looking for with the correct regex expression.

 

If it helps the page I'm parsing is: http://www.imdb.com/title/tt0811080/

 

@ghostdog74: I wouldn't be sure how to implement your solution with the code I have already. However, thank you for your help.

Well, here's how I would fetch anything within quotes (double or single at that) in an src:

 

$str = '<a name="poster" href="/rg/action-box-title/primary-photo/media/rm118396416/tt0811080" title="Speed Racer"><img border="0" alt="Speed Racer" title="Speed Racer" src="http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg" /></a>';
preg_match('#src=["\']([^"\']+)["\']#', $str, $match);
echo $match[1];

 

Output:

http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg

  • 4 years later...

possible to update for this code

 

<td rowspan="2" id="img_primary">
					    <div class="image">
<a href="/media/rm2761862144/tt0978762?ref_=tt_ov_i" > <img height="317"
width="214"
alt="Mary et Max. (2009) Poster"
title="Mary et Max. (2009)"
src="http://ia.media-imdb.com/images/M/MV5BMTQ1NDIyNTA1Nl5BMl5BanBnXkFtZTcwMjc2Njk3OA@@._V1_SY317_CR4,0,214,317_.jpg"
itemprop="image" />
</a>						    </div>
			    </td>

 

actually i this pregmatch

 

preg_match('#<td rowspan="2" id="img_primary">[^"]+<div class="image"><a.*" ><img * src="(.*)" .*><\\/a><\\/div><\\/td>#isU', $text, $photo );

 

this not work

 

very thanks

you are MUCH MUCH better off using PHP's DOMDocument class. Scraping information like that is sooo much easier and it is also easier to fix whenever that site changes something in their code.

 

Google DOMDocument and you should receive everything you need.

you are MUCH MUCH better off using PHP's DOMDocument class. Scraping information like that is sooo much easier and it is also easier to fix whenever that site changes something in their code.

 

Google DOMDocument and you should receive everything you need.

 

yes i recoded old scrypt after ,

 

Actually not issue for pregmatch ?

One line might be doable with Regular Expressions, but with the amount of attributes and tags you're trying to match RegExps are not going to be adequate. So, no; This is not an issue for which you want to use preg_match ().

HTML is a markup language, not a regular language, after all. ;)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.