Akkari Posted October 30, 2012 Share Posted October 30, 2012 Hello there everyone, Been a while since I last posted. I've successfully retrieved external HTML pages using Curl, through the following code: $ch = curl_init("http://www.site.com/"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_BINARYTRANSFER, true); $content = curl_exec($ch); curl_close($ch); I then tried adding something, which I thought that there certainly has more to it than that but tried it anyway: if(strpos($content,"string_to_search_for") == false) echo "Not found."; else echo "Found."; Now this returned "Not found" every time for me, even if the string was present in that page source code. On a side note, I will be using this to evaluate hundreds, potentially thousands of websites to see if they have that string present in the source code. How long do you think execution time would be for say, a 1000 URL? And would there be a better approach to speed things up? Thanks! Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/ Share on other sites More sharing options...
sumpygump Posted October 30, 2012 Share Posted October 30, 2012 Depending on the string search for and its context within the site, strpos might not be the best solution because of white space and other differences in the string that could lead to not matches. For example, spelling typos, missing or different punctuation, extra spaces or other white space including line breaks. A regex might be a better bet, but still have room for false negatives. What is the nature of the strings you are searching for? This is one of the problems, if not the main problem with the usefulness and reliability of "screen scraping." Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1388849 Share on other sites More sharing options...
ManiacDan Posted October 30, 2012 Share Posted October 30, 2012 Are you sure the content is correct? Some sites will present different content to non-browsers like curl. Snoopy() is a PHP browser which can masquerade as a "real" browser. Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1388850 Share on other sites More sharing options...
Akkari Posted October 31, 2012 Author Share Posted October 31, 2012 Thanks a lot for the response guys. @sumpygump The string is "/images/" (without the quotes of course) which occurs within URLs referencing the images folder of the website. So usually it'll occur as part of an image URL displayed on the page. @ManiacDan I think your suggestion might be useful in other situations so I'm looking it up now. However, in this particular situation I echoed out $content and it displayed the target site perfectly. Appreciate your responses, guys! Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1388880 Share on other sites More sharing options...
haku Posted October 31, 2012 Share Posted October 31, 2012 I don't have the solution to your problem, but you should note that you should generally use this: if(strpos($content,"string_to_search_for") === false) or else you could get a false reading if the string you are searching for is at the start of the string being searched. This is because 0 will be returned when the string is at the start, which evaluates to FALSE. Using three = ensures that it's an actual FALSE, and not something that evaulates to FALSE. Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1388896 Share on other sites More sharing options...
ManiacDan Posted October 31, 2012 Share Posted October 31, 2012 strpos is case sensitive, is the target string /Images/? Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1388906 Share on other sites More sharing options...
haku Posted October 31, 2012 Share Posted October 31, 2012 That's a good point. It may be better to use: if(stripos($content,"string_to_search_for") === false) Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1388913 Share on other sites More sharing options...
Akkari Posted October 31, 2012 Author Share Posted October 31, 2012 Thanks for the input everyone. I managed to solve the issue by changing the test website I was testing the script on. It turned out that the original test website http://site.com had a 3xx redirect to http://www.site.com which made curl fail to retrieve the website. I will open a new topic with that new issue, hopefully someone would be able to point me in the right direction. Thanks! Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1388986 Share on other sites More sharing options...
ManiacDan Posted October 31, 2012 Share Posted October 31, 2012 So when you echoed the content and it was correct...how did that happen? Curl can be set to follow redirects with the same setopt you're already using. Quote Link to comment https://forums.phpfreaks.com/topic/270094-looking-up-a-string-in-an-html-source/#findComment-1389024 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.