Jump to content

Grab/scrape most relevant image from website


wannabe21

Recommended Posts

 

Hi,

 

This is my first post and I would like to kick it off with a question :)

 

I am trying to write a function that grabs the most relevant image from a website, however I encounter a few problems for which I need a bit of help for. To get into what I am trying to accomplish I suggest reading the article on :  shareaholic  This article describes quite in depth what you should do to grab the most relevant image from a website. Most functionality as described in that article are functional in my own program. 

  • just 32KB of images are fetched to get the headers so that i can calculate the width & height and aspect ratio. the array is sorted, big on top, small on bottom
  • All OG Meta tags and Twitter Tags are found and stored in an array
  • todo: compile a list of most used DIV ID for content, wordpress is using 'content' and other CMS's 'main' . etc etc

With the information I should be able to grab the most relevant image from most websites, BUT...

 

The thing that I encounter is that the biggest image is not always the most relevant image plus some ad images on certain websites have the perfect aspect ratio and are quite big, so I get wrong results. 

 

Did anyone here ever tried to do the same thing? and if so, how did you work around 'my' problem? Perhaps my approach is totally not correct.

 

Thanks in advance,

 

W//

 

 

 

 

Link to comment
Share on other sites

Never gonna make a script that gets it right every time, is no set way sites do it.

 

Best thing to do is figure it out for each site are visiting, have your script use whichever pattern matching method for that one.

 

Such as first image in a particular divider, using a certain id,name or class.

 

Downloading images for their sizes is horrible, even if it's just partial downloads.

 

Only consider an image if it's from their same domain, excluding any 3rd party linked images.

 

Can make a list to exclude domains/subdomains from advertising sites or anything not desired.

 

Certainly opengraph is not the best way to go about it.

Have a look at this parser script i made.

http://dynainternet.com/dynavid/og-oembed/?url=http%3A%2F%2Fforums.phpfreaks.com%2Ftopic%2F288157-grabscrape-most-relevant-image-from-website%2F

 

Using this post page, surely this isn't the most relevant image using opengraph.

http://forums.phpfreaks.com/public/style_images/phpfreaks/meta_image.png

Link to comment
Share on other sites

 

Partial downloading images is horrible you mentioned, is there any other way to get their size?

 

I am aware that websites are not uniformly build, there is no standard with regards on how content is build. Also, excluding 3rd party links is not an option because some websites pull their images from a completely different domain.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.