Jump to content

Scraping a password protected external site


Recommended Posts

I need to extract data from one of our suppliers but their system is password protected.  If I could work the access information into the @file_get_contents statement I could get the data w/ no problem.

 

Any help would be great.  Here is the page code of the data I need to pass in the login information.

 

===============================================

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
    <HEAD>
        <title>Welcome to xxxxxxxxxxxxxx.com. Please Login...</title>
        <meta content="Microsoft Visual Studio .NET 7.1" name="GENERATOR">
        <meta content="Visual Basic .NET 7.1" name="CODE_LANGUAGE">
        <meta content="JavaScript" name="vs_defaultClientScript">
        <meta content="http://schemas.microsoft.com/intellisense/ie5" name="vs_targetSchema">
        <script language="JavaScript">
        function setFocus()
              {
                    if (Login.txtUserID.value == '')
                        {
                            Login.txtUserID.focus();
                        }
                    else
                        {
                            Login.txtPassword.focus();
                        }
                }
        </script>

    </HEAD>
    <body link="#e00ee0" bgColor="#ffffff" onload="setFocus();" MS_POSITIONING="GridLayout">
        <form name="Login" method="post" action="https://www.xxxxxxxxxx.com/Catalog/series_detail.asp?varGroup=20&varSeries=14953?hcu=1&hs=1&hm=1" id="Login">
<input type="hidden" name="__VIEWSTATE" value="FdJ0ceC8UWuw3Tw+c2uk+WrgGSSaMeEhJL36a7EtIQdRABdjOpSAApMnrAp6uKmaT9bE5Uq5FqUeMhKPC5N4SvCYRu6arNDFw9Iu3/8aUH/0OlLUkJj3Ub0bEmCoEJ5axcNr1gY/j6wcc+vYXNcCvaP4U4nIVwAEEn48Kwt1EGBLrv6iVPvkkKMt1ADAvtR3RGQmncS3tcmxXs9EiN1V+Niqq1lt4s2v" />

I doubt you can successfully scrape a password protected file with file_get_contents, you will probably have to use cURL. Also, it looks like the target page is in asp, which will likely make your life a lot more difficult. You will probably have to make a cURL request to the login page, capturing the cookies as you do so, then post the login details along with the cookies, then request the page(s) that you actually want to scrape.

Just to add: it is *possible* that the site is coded to accept login variables on the URL. There are some that do that with the username and some with the password as well. However, you would have to know if they are allowed and what the parameter names are.

 

Here is what you could try. View the source of the login page and see what the field names are for the username and the password. Then add those namess along with their respective values to the url and see if that gives you access to the page without being ogged in already.

  • 1 month later...

I doubt you can successfully scrape a password protected file with file_get_contents, you will probably have to use cURL. Also, it looks like the target page is in asp, which will likely make your life a lot more difficult. You will probably have to make a cURL request to the login page, capturing the cookies as you do so, then post the login details along with the cookies, then request the page(s) that you actually want to scrape.

:qft:

This maybe useful:

http://www.askapache.com/htaccess/sending-post-form-data-with-php-curl.html

 

Try this:

1. install live http headers firefox plugin.

2. login with your browser.

3. view the initial first post, and see the post contents. It will be in this format:

username=yourusername&password=yourpassword&otherstuff......

 

4. the above post is what you need to replicate with your PHP script using cURL.

 

5. Login with cURL (enable cURL cookies) using PHP, then try to download the file you are after. It should work as you now have authenticated yourself to the server you are downloading from.

phoenixx, your HTML sample is missing a few details (like the username/password fields!) and without more details we can't help you with targetted answers. Can we see the full HTML?  Even better then that would be a trace of the HTTP headers/content from the requests and responses when logging in and accessing the password-protected page.

 

For what it's worth, you almost certainly (barring anything really crazy) could do what you want with file_get_contents() even if some folks in this thread say otherwise.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.