Jump to content

Recommended Posts

Im currently doing website scraping for the first time, although im getting success, im finding that for when a piece of code is dynamic, then im having to do alot of additional lines when if i could do something like <div id"{wildcard for any value}"> then it would reduce the lines required to collect the data,

here is an example of the html im scraping:

< div id = "productWrapper" >
  < div id = "hih_2126_348" class = "descriptionDetails" data - product - id = "22133545" >
    < div class = "desc" id = "hih_3_266_401" >
      < h1 id = "hih_3_266_4441" >
        < span data - title = "true" id = "hih_1_1466_99" > ProductNameHere </span>
      </h1 >

so every page that i try scraping potentially has a different value id for the div h1 span tags

 

the id value can change in characters/length/symbols etc

 

is there a way to basically scrape between for example < span data - title = "true" id = "{wildcard to allow for any value/text here}" > and </span>

 

the php functions i'm currently using to scrape are below.

 

any help would be awesome

 

thanks.

                // Defining the basic cURL function
                function curl($url) {
                    // Assigning cURL options to an array
                    $options = Array(
                        CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
                        CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
                        CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
                        CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
                        CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
                        CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
                        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
                        CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
                    );
                    $ch = curl_init();  // Initialising cURL
                    curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
                    $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
                    curl_close($ch);    // Closing cURL
                    return $data;   // Returning the data from the function
                }

                // Defining the basic scraping function
                function scrape_between($data, $start, $end){
                    $data = stristr($data, $start); // Stripping all data from before $start
                    $data = substr($data, strlen($start));  // Stripping $start
                    $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
                    $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
                    return $data;   // Returning the scraped data from the function
                }


Edited by seany123

Use the DOM extension and DOMXPath to process the html. Using xpath queries to target the elements will provide more flexibility and easier extraction of the data.

 

For example (untested)

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$query = $xpath->query('//div[@class="productDetails"]/span/text()');
$data = $query->item(0)->textContent;
Read up on XPath and how to do queries. Mine above may not be exactly right, it's been a while since I used it.
  • 4 weeks later...

To directly answer your question, with regex you can use a negative character class to match anything that is not a quote. You didn't actually post your regex code, so a simple example matching relevant part:

preg_match('~<span id="[^"]*">~',$content,$match);

However, I agree with kicken about using a DOM parser.  Depending on what exactly you are looking to grab, you can sometimes get away with parsing html with regex, but in general, regex alone cannot be used to fully and reliably parse html.  Regular expressions (regex) parses regular language types (hence the name), meaning, there is an identifiable pattern to the context.  HTML is a context free language, which means it is not regular.  In order to reliably parse html, you need a combination of regex, loops, conditions, tokens, etc. basically, all the things that make up a DOM parser, which is exactly why we have DOM parsers. 

Edited by .josh
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.