cry of war Posted November 10, 2008 Share Posted November 10, 2008 I wanted to try something new in my quest to learn PHP and the many uses it has. I want to steal a websites information.... (not as bad as it sounds and if it is even possible) I want to be able to insert a given URL selected from a SQL array and extract any and all information from the website and save it to the server with out having to research that website and retype out the information on the website everyday. Finds website=> insert website into server=> Run Cron job or PHP file manually=> stores informations from website like pricing and other information Kinda like Google's Bot program but doesn't index the whole website only certain sites and selections. And if you read this, and you dont know the answer but know you cant do this could you please tell me it would be a great help Quote Link to comment Share on other sites More sharing options...
flyhoney Posted November 10, 2008 Share Posted November 10, 2008 This is definitely possible. I've written parsers for several websites when they did not have RSS feeds or some other method for retrieving data. Depending on your server configuration, you can either use fopen() to grab the contents of the webpage, or you may have to use CURL (http://us3.php.net/curl). After you have grabbed the contents of the web page, you can use regular expressions to parse out the relevant info Quote Link to comment Share on other sites More sharing options...
cry of war Posted November 10, 2008 Author Share Posted November 10, 2008 so i would do something like: <? $info=fopen("www.blah.com","r") preg_match("the regex", $info, $matches); print_r($matches); /*or insert into server*/ ?> The last question i have about this is how much of a lag down would this be on the server im trying to get the information from? Quote Link to comment Share on other sites More sharing options...
divadiva Posted November 10, 2008 Share Posted November 10, 2008 Use CURL and regular expressions.It can be of great help. Basic steps: Load HTML 2) Use regular expression 3) store the data in database. Hope it works!! Quote Link to comment Share on other sites More sharing options...
flyhoney Posted November 10, 2008 Share Posted November 10, 2008 Presumably, downloading the page using fopen() and viewing the page in a browser has the same affect on the server. Quote Link to comment Share on other sites More sharing options...
premiso Posted November 10, 2008 Share Posted November 10, 2008 If CURL is available it will be a ton faster than fopen or file_get_contents. CURL was made for doing exactly what you want, fopen and file_get_contents were made for reading files on a server to say. As to the lag, it depends on your server's distance to the other server and their connections. It is usually pretty quick, but say a server in Texas trying to connect to a server in England will be slower than Texas hitting a server in California. as divadiva said, I would definitely use CURL over the later, if possible. Quote Link to comment Share on other sites More sharing options...
cry of war Posted November 10, 2008 Author Share Posted November 10, 2008 ok you say CURL but how would i got about using that I really dont know what that is....... (bit noobish besides some of php) Quote Link to comment Share on other sites More sharing options...
premiso Posted November 10, 2008 Share Posted November 10, 2008 www.php.net/curl I would first do a phpinfo(); to see if your server even allows curl (some servers do not for security). If they do than I bet googling on CURL and Webpage Fetching will bring up some good results. If you just want to get it done, and do not really care about a few seconds of speed saved, I would just do it the way you know how and learn CURL at your convience when you need it. Quote Link to comment Share on other sites More sharing options...
cry of war Posted November 10, 2008 Author Share Posted November 10, 2008 now when you say a few seconds how many webpages are you talking about? the website im trying to index has 50,000+ items on different pages always changing Quote Link to comment Share on other sites More sharing options...
premiso Posted November 10, 2008 Share Posted November 10, 2008 Honestly, when I did my bench testing vs fopen and curl, curl was usually a second or two quicker on retrieving the data from the webpage. I was fetching Yahoo Movie times and displaying them in your own format to the users. To just get the initial data from yahoo with fopen took about 2-3 seconds depending, to do it with CURL it took about 1-2 seconds. It does save time, but if you need it working asap, go with what you know and you can always change how the file is fetched later on with a few lines of code. The rest of the code will/should stay the same. Quote Link to comment Share on other sites More sharing options...
nitation Posted November 10, 2008 Share Posted November 10, 2008 Curl is one of the solutions Quote Link to comment Share on other sites More sharing options...
cry of war Posted November 10, 2008 Author Share Posted November 10, 2008 alright thank you guys for the quick response now its just time to learn to make regex function work properly Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.