Read website to mysql

yfede · August 23, 2015

Hi.

I am trying to read an website with a php script. but I am not sure where to start.

what It need to do.

1) log in to http://example.com/Login.asp with username and password.

2) read from a list a text file on my server, what links it should read ex. shop.example.com/Productdetail.asp?SID=126&ProductID=8638

3) read different kind of information, ex. <label for="6267">740071</label> where it reads the 740071 and save to a line in my mysql database ex

fd_database /`fd_info`/reference

another information could also be <strong class="example_thing">5,70 m</strong> where it reads the 5,70 and save it fd_database /`fd_info`/reference

so both informations in the same tabel

after that go back to 2) until there are no more links in the text file.

Thanks for reading

QuickOldCar · August 23, 2015

You want to crawl and scrape a website.

Should ask the site owners permission or see if there are better ways to obtain the data, such as an api or feed.

1) log in to http://example.com/Login.asp with username and password.

Use curl to connect and log in, pass the parameters and values as well with CURLOPT_POSTFIELDS.

There are examples at the curl link.

2) read from a list a text file on my server, what links it should read ex. shop.example.com/Productdetail.asp?SID=126&ProductID=8638

The best way is to scrape all href links on a page and store them into a database, from each of those links you scrape more until it can no longer find more.

Can use parse_url or some sort of string match to determine if is from their domain.

If you are doing one site and know the patterns for pagination or exact links, then it's best to write something that does just that.

Store all urls into a database, use a unique constraint and mark them visited. Such as a 1/0 for true/false values.

You then fetch the earliest timestamp that was not visited until are no more left.

To sum it up you would initially visit a page, every time this script runs it will scrape all links on the page and also data you want from that particular page, then fetch a new url from database.

Using a text file is inneficient and also has to remove duplicates.

3) read different kind of information, ex. <label for="6267">740071</label> where it reads the 740071 and save to a line in my mysql database ex

fd_database /`fd_info`/reference

another information could also be <strong class="example_thing">5,70 m</strong> where it reads the 5,70 and save it fd_database /`fd_info`/reference

so both informations in the same tabel

This is called parsing, you can grab the raw html from curl and use various methods, some are better tailored for specific items.

As for the scraping data aspect:

curl (to me is the best method to connect and can also follow redirects)

file_get_contents (fast and easy, can create a stream context but still limited in what you can do, it will fail a lot)

preg_match or preg_match_all

simplehtmldom

dom

simplexml

You will also have to fix relative urls, determine and convert/replace character,language and document encoding

yfede · August 24, 2015

Hi I have already asked the owner, but they dont have a api i can use :/

But Thank you for the help, I will try and start here

CroNiX · August 24, 2015

Did the owner give you (written) permission to take their copyrighted material and use it for your own purposes?

yfede · August 24, 2015

Yes I have a agreement with the owner of the site.

Sign In

Read website to mysql

Recommended Posts

yfede

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

yfede

Link to comment

Share on other sites

CroNiX

Link to comment

Share on other sites

yfede

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information