Jump to content

Read website to mysql


yfede

Recommended Posts

Hi. 

 

I am trying to read an website with a php script. but I am not sure where to start. 

 

what It need to do. 

 

1) log in to http://example.com/Login.asp with username and password. 

 

2) read from a list a text file on my server, what links it should read ex. shop.example.com/Productdetail.asp?SID=126&ProductID=8638

 

3) read different kind of information, ex. <label for="6267">740071</label> where it reads the 740071 and save to a line in my mysql database ex

fd_database /`fd_info`/reference

another information could also be <strong class="example_thing">5,70 m</strong> where it reads the 5,70 and save it fd_database /`fd_info`/reference

so both informations in the same tabel 

 

after that go back to 2) until there are no more links in the text file. 

 

Thanks for reading :)

 

Link to comment
Share on other sites

You want to crawl and scrape a website.

Should ask the site owners permission or see if there are better ways to obtain the data, such as an api or feed.

 

1) log in to http://example.com/Login.asp with username and password.

Use curl to connect and log in, pass the parameters and values as well with CURLOPT_POSTFIELDS.

There are examples at the curl link.

 

2) read from a list a text file on my server, what links it should read ex. shop.example.com/Productdetail.asp?SID=126&ProductID=8638

The best way is to scrape all href links on a page and store them into a database, from each of those links you scrape more until it can no longer find more.

Can use parse_url or some sort of string match to determine if is from their domain.

If you are doing one site and know the patterns for pagination or exact links, then it's best to write something that does just that.

Store all urls into a database, use a unique constraint and mark them visited. Such as a 1/0 for true/false values.

You then fetch the earliest timestamp that was not visited until are no more left.

 

To sum it up you would initially visit a page, every time this script runs it will scrape all links on the page and also data you want from that particular page, then fetch a new url from database.

 

Using a text file is inneficient and also has to remove duplicates.

 

3) read different kind of information, ex. <label for="6267">740071</label> where it reads the 740071 and save to a line in my mysql database ex

fd_database /`fd_info`/reference

another information could also be <strong class="example_thing">5,70 m</strong> where it reads the 5,70 and save it fd_database /`fd_info`/reference

so both informations in the same tabel

This is called parsing, you can grab the raw html from curl and use various methods, some are better tailored for specific items.

 

As for the scraping data aspect:

curl (to me is the best method to connect and can also follow redirects)

file_get_contents (fast and easy, can create a stream context but still limited in what you can do, it will fail a lot)

preg_match or preg_match_all

simplehtmldom

dom

simplexml

 

You will also have to fix relative urls, determine and convert/replace character,language and document encoding

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.