Jump to content

Looking Up A String In An Html Source?


Akkari

Recommended Posts

Hello there everyone,

 

Been a while since I last posted. I've successfully retrieved external HTML pages using Curl, through the following code:

 

$ch = curl_init("http://www.site.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);

 

I then tried adding something, which I thought that there certainly has more to it than that but tried it anyway:

 

if(strpos($content,"string_to_search_for") == false) echo "Not found."; else echo "Found.";

 

Now this returned "Not found" every time for me, even if the string was present in that page source code.

 

On a side note, I will be using this to evaluate hundreds, potentially thousands of websites to see if they have that string present in the source code. How long do you think execution time would be for say, a 1000 URL? And would there be a better approach to speed things up?

 

Thanks!

Link to comment
Share on other sites

Depending on the string search for and its context within the site, strpos might not be the best solution because of white space and other differences in the string that could lead to not matches. For example, spelling typos, missing or different punctuation, extra spaces or other white space including line breaks.

 

A regex might be a better bet, but still have room for false negatives. What is the nature of the strings you are searching for? This is one of the problems, if not the main problem with the usefulness and reliability of "screen scraping."

Link to comment
Share on other sites

Thanks a lot for the response guys.

 

@sumpygump

 

The string is "/images/" (without the quotes of course) which occurs within URLs referencing the images folder of the website. So usually it'll occur as part of an image URL displayed on the page.

 

@ManiacDan

 

I think your suggestion might be useful in other situations so I'm looking it up now. However, in this particular situation I echoed out $content and it displayed the target site perfectly.

 

Appreciate your responses, guys!

Link to comment
Share on other sites

I don't have the solution to your problem, but you should note that you should generally use this:

if(strpos($content,"string_to_search_for") === false)

or else you could get a false reading if the string you are searching for is at the start of the string being searched. This is because 0 will be returned when the string is at the start, which evaluates to FALSE. Using three = ensures that it's an actual FALSE, and not something that evaulates to FALSE.

Link to comment
Share on other sites

Thanks for the input everyone.

 

I managed to solve the issue by changing the test website I was testing the script on. It turned out that the original test website http://site.com had a 3xx redirect to http://www.site.com which made curl fail to retrieve the website. I will open a new topic with that new issue, hopefully someone would be able to point me in the right direction.

 

Thanks!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.