yfede Posted August 23, 2015 Share Posted August 23, 2015 Hi. I am trying to read an website with a php script. but I am not sure where to start. what It need to do. 1) log in to http://example.com/Login.asp with username and password. 2) read from a list a text file on my server, what links it should read ex. shop.example.com/Productdetail.asp?SID=126&ProductID=8638 3) read different kind of information, ex. <label for="6267">740071</label> where it reads the 740071 and save to a line in my mysql database ex fd_database /`fd_info`/reference another information could also be <strong class="example_thing">5,70 m</strong> where it reads the 5,70 and save it fd_database /`fd_info`/reference so both informations in the same tabel after that go back to 2) until there are no more links in the text file. Thanks for reading Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted August 23, 2015 Share Posted August 23, 2015 You want to crawl and scrape a website. Should ask the site owners permission or see if there are better ways to obtain the data, such as an api or feed. 1) log in to http://example.com/Login.asp with username and password. Use curl to connect and log in, pass the parameters and values as well with CURLOPT_POSTFIELDS. There are examples at the curl link. 2) read from a list a text file on my server, what links it should read ex. shop.example.com/Productdetail.asp?SID=126&ProductID=8638 The best way is to scrape all href links on a page and store them into a database, from each of those links you scrape more until it can no longer find more. Can use parse_url or some sort of string match to determine if is from their domain. If you are doing one site and know the patterns for pagination or exact links, then it's best to write something that does just that. Store all urls into a database, use a unique constraint and mark them visited. Such as a 1/0 for true/false values. You then fetch the earliest timestamp that was not visited until are no more left. To sum it up you would initially visit a page, every time this script runs it will scrape all links on the page and also data you want from that particular page, then fetch a new url from database. Using a text file is inneficient and also has to remove duplicates. 3) read different kind of information, ex. <label for="6267">740071</label> where it reads the 740071 and save to a line in my mysql database ex fd_database /`fd_info`/reference another information could also be <strong class="example_thing">5,70 m</strong> where it reads the 5,70 and save it fd_database /`fd_info`/reference so both informations in the same tabel This is called parsing, you can grab the raw html from curl and use various methods, some are better tailored for specific items. As for the scraping data aspect: curl (to me is the best method to connect and can also follow redirects) file_get_contents (fast and easy, can create a stream context but still limited in what you can do, it will fail a lot) preg_match or preg_match_all simplehtmldom dom simplexml You will also have to fix relative urls, determine and convert/replace character,language and document encoding Quote Link to comment Share on other sites More sharing options...
yfede Posted August 24, 2015 Author Share Posted August 24, 2015 Hi I have already asked the owner, but they dont have a api i can use :/ But Thank you for the help, I will try and start here Quote Link to comment Share on other sites More sharing options...
CroNiX Posted August 24, 2015 Share Posted August 24, 2015 Did the owner give you (written) permission to take their copyrighted material and use it for your own purposes? Quote Link to comment Share on other sites More sharing options...
yfede Posted August 24, 2015 Author Share Posted August 24, 2015 Yes I have a agreement with the owner of the site. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.