Jump to content

[SOLVED] HTML Extraction using php


cry of war

Recommended Posts

I wanted to try something new in my quest to learn PHP and the many uses it has.

 

I want to steal a websites information.... (not as bad as it sounds and if it is even possible)

 

I want to be able to insert a given URL selected from a SQL array and extract any and all information from the website and save it to the server with out having to research that website and retype out the information on the website everyday.

 

Finds website=> insert website into server=> Run Cron job or PHP file manually=> stores informations from website like pricing and other information

 

Kinda like Google's Bot program but doesn't index the whole website only certain sites and selections.

 

And if you read this, and you dont know the answer but know you cant do this could you please tell me it would be a great help

Link to comment
Share on other sites

This is definitely possible.  I've written parsers for several websites when they did not have RSS feeds or some other method for retrieving data.  Depending on your server configuration, you can either use fopen() to grab the contents of the webpage, or you may have to use CURL (http://us3.php.net/curl).  After you have grabbed the contents of the web page, you can use regular expressions to parse out the relevant info

Link to comment
Share on other sites

so i would do something like:

 

<?

$info=fopen("www.blah.com","r")

preg_match("the regex", $info, $matches);

print_r($matches);

/*or insert into server*/

?>

 

 

The last question i have about this is how much of a lag down would this be on the server im trying to get the information from?

Link to comment
Share on other sites

If CURL is available it will be a ton faster than fopen or file_get_contents.

 

CURL was made for doing exactly what you want, fopen and file_get_contents were made for reading files on a server to say. 

 

As to the lag, it depends on your server's distance to the other server and their connections. It is usually pretty quick, but say a server in Texas trying to connect to a server in England will be slower than Texas hitting a server in California.

 

as divadiva said, I would definitely use CURL over the later, if possible.

Link to comment
Share on other sites

www.php.net/curl

 

I would first do a phpinfo();  to see if your server even allows curl (some servers do not for security).

 

If they do than I bet googling on CURL and Webpage Fetching will bring up some good results.

 

If you just want to get it done, and do not really care about a few seconds of speed saved, I would just do it the way you know how and learn CURL at your convience when you need it.

Link to comment
Share on other sites

Honestly, when I did my bench testing vs fopen and curl, curl was usually a second or two quicker on retrieving the data from the webpage. I was fetching Yahoo Movie times and displaying them in your own format to the users.

 

To just get the initial data from yahoo with fopen took about 2-3 seconds depending, to do it with CURL it took about 1-2 seconds. It does save time, but if you need it working asap, go with what you know and you can always change how the file is fetched later on with a few lines of code. The rest of the code will/should stay the same.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.