Extracting an image src using a hyperlink as an anchor

tmhai · October 29, 2008

Hello.

I'm trying to extract the image src value. The sample data I'm trying to extract the link from is:

<a name="poster" href="/rg/action-box-title/primary-photo/media/rm118396416/tt0811080" title="Speed Racer"><img border="0" alt="Speed Racer" title="Speed Racer" src="http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg" /></a>

The expected output should be ANYTHING in between the quotation marks of the src field. The regex expression will need to search for the <a name="poster" text to extract the correct image link value.

This is the code I have so far, which extracts other data from an IMDB page. I'm trying to extract the Poster Image link as well:

<?php

//url
$imdbcode = $_GET['code'];
$url = 'http://www.imdb.com/title/'.$imdbcode.'/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
preg_match($regex,$content,$matches);
return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}

?>

Cheers.

As a sidenote. I took a stab at it and came up with the following line which I placed under the "//parse for product name" comment:

$poster = get_match('/<a name="\/poster" [^>]*><img [^> src="(.*)" /></a>/isU',$imdb_content);

as well as added under the first instance of $content:

$content.= '<h2>Poster</h2><p>'.$poster.'</p>';

That yeilded the following error:

Warning: preg_match() [function.preg-match]: Unknown modifier '>' in /home/jurud/public_html/imdb.php on line 34

I'm guessing my regex expression isnt up to scratch.

DarkWater · October 29, 2008

Since you used / as your delimiter, you need to escape all of the / in the regex, or use a different delimiter.

ghostdog74 · October 29, 2008

Hello.

I'm trying to extract the image src value.

when parsing XML/HTML, its better to use dedicated classes/methods (if you have them) than constructing regex from scratch.

$string = '<a name="poster" href="/rg/action-box-title/primary-photo/media/rm118396416/tt0811080" title="Speed Racer"><img border="0" alt="Speed Racer" title="Speed Racer" src="http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg" /></a>';


if (  ($start = strpos($string,'<a name="poster"' ) ) !==FALSE ) {
    $xml = new SimpleXMLElement($string);
    echo $xml->img['src'];
}

tmhai · October 29, 2008

Thank you both for your replies.

@DarkWater: I didn't write this code, and I have absoultely no experience with regex expressions but I figured out what u mean by escaping all the slashes so I came up with this:

$poster = get_match('/<a name="\/poster" [^>]*><img [^>]* src="(.*)" \/><\/a>/isU',$imdb_content);

That now doesn't show any error messages, but it also doesn't return anything either. So now I just need the help with figuring out how to correctly identify what Im looking for with the correct regex expression.

If it helps the page I'm parsing is: http://www.imdb.com/title/tt0811080/

@ghostdog74: I wouldn't be sure how to implement your solution with the code I have already. However, thank you for your help.

nrg_alpha · October 29, 2008

Well, here's how I would fetch anything within quotes (double or single at that) in an src:

$str = '<a name="poster" href="/rg/action-box-title/primary-photo/media/rm118396416/tt0811080" title="Speed Racer"><img border="0" alt="Speed Racer" title="Speed Racer" src="http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg" /></a>';
preg_match('#src=["\']([^"\']+)["\']#', $str, $match);
echo $match[1];

Output:

http://ia.media-imdb.com/images/M/MV5BMTA5MjgxMDE4OTVeQTJeQWpwZ15BbWU3MDgyNjc4NjE@._V1._SX94_SY140_.jpg

jojo2a2a · February 8, 2013

possible to update for this code

<td rowspan="2" id="img_primary">
					    <div class="image">
<a href="/media/rm2761862144/tt0978762?ref_=tt_ov_i" > <img height="317"
width="214"
alt="Mary et Max. (2009) Poster"
title="Mary et Max. (2009)"
src="http://ia.media-imdb.com/images/M/MV5BMTQ1NDIyNTA1Nl5BMl5BanBnXkFtZTcwMjc2Njk3OA@@._V1_SY317_CR4,0,214,317_.jpg"
itemprop="image" />
</a>						    </div>
			    </td>

actually i this pregmatch

preg_match('#<td rowspan="2" id="img_primary">[^"]+<div class="image"><a.*" ><img * src="(.*)" .*><\\/a><\\/div><\\/td>#isU', $text, $photo );

this not work

very thanks

Zane · February 8, 2013

you are MUCH MUCH better off using PHP's DOMDocument class. Scraping information like that is sooo much easier and it is also easier to fix whenever that site changes something in their code.

Google DOMDocument and you should receive everything you need.

jojo2a2a · February 14, 2013

you are MUCH MUCH better off using PHP's DOMDocument class. Scraping information like that is sooo much easier and it is also easier to fix whenever that site changes something in their code.

Google DOMDocument and you should receive everything you need.

yes i recoded old scrypt after ,

Actually not issue for pregmatch ?

Christian F. · February 14, 2013

One line might be doable with Regular Expressions, but with the amount of attributes and tags you're trying to match RegExps are not going to be adequate. So, no; This is not an issue for which you want to use preg_match ().

HTML is a markup language, not a regular language, after all.

Sign In

Extracting an image src using a hyperlink as an anchor

Recommended Posts

tmhai

Link to comment

Share on other sites

DarkWater

Link to comment

Share on other sites

ghostdog74

Link to comment

Share on other sites

tmhai

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

jojo2a2a

Link to comment

Share on other sites

Zane

Link to comment

Share on other sites

jojo2a2a

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information