wannabe21 Posted May 1, 2014 Share Posted May 1, 2014 Hi, This is my first post and I would like to kick it off with a question I am trying to write a function that grabs the most relevant image from a website, however I encounter a few problems for which I need a bit of help for. To get into what I am trying to accomplish I suggest reading the article on : shareaholic This article describes quite in depth what you should do to grab the most relevant image from a website. Most functionality as described in that article are functional in my own program. just 32KB of images are fetched to get the headers so that i can calculate the width & height and aspect ratio. the array is sorted, big on top, small on bottom All OG Meta tags and Twitter Tags are found and stored in an array todo: compile a list of most used DIV ID for content, wordpress is using 'content' and other CMS's 'main' . etc etc With the information I should be able to grab the most relevant image from most websites, BUT... The thing that I encounter is that the biggest image is not always the most relevant image plus some ad images on certain websites have the perfect aspect ratio and are quite big, so I get wrong results. Did anyone here ever tried to do the same thing? and if so, how did you work around 'my' problem? Perhaps my approach is totally not correct. Thanks in advance, W// Quote Link to comment https://forums.phpfreaks.com/topic/288157-grabscrape-most-relevant-image-from-website/ Share on other sites More sharing options...
QuickOldCar Posted May 1, 2014 Share Posted May 1, 2014 Never gonna make a script that gets it right every time, is no set way sites do it. Best thing to do is figure it out for each site are visiting, have your script use whichever pattern matching method for that one. Such as first image in a particular divider, using a certain id,name or class. Downloading images for their sizes is horrible, even if it's just partial downloads. Only consider an image if it's from their same domain, excluding any 3rd party linked images. Can make a list to exclude domains/subdomains from advertising sites or anything not desired. Certainly opengraph is not the best way to go about it. Have a look at this parser script i made. http://dynainternet.com/dynavid/og-oembed/?url=http%3A%2F%2Fforums.phpfreaks.com%2Ftopic%2F288157-grabscrape-most-relevant-image-from-website%2F Using this post page, surely this isn't the most relevant image using opengraph. http://forums.phpfreaks.com/public/style_images/phpfreaks/meta_image.png Quote Link to comment https://forums.phpfreaks.com/topic/288157-grabscrape-most-relevant-image-from-website/#findComment-1477827 Share on other sites More sharing options...
wannabe21 Posted May 1, 2014 Author Share Posted May 1, 2014 Partial downloading images is horrible you mentioned, is there any other way to get their size? I am aware that websites are not uniformly build, there is no standard with regards on how content is build. Also, excluding 3rd party links is not an option because some websites pull their images from a completely different domain. Quote Link to comment https://forums.phpfreaks.com/topic/288157-grabscrape-most-relevant-image-from-website/#findComment-1477836 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.