using wildcard to scrape between changing div id

seany123 · July 7, 2016

Im currently doing website scraping for the first time, although im getting success, im finding that for when a piece of code is dynamic, then im having to do alot of additional lines when if i could do something like <div id"{wildcard for any value}"> then it would reduce the lines required to collect the data,

here is an example of the html im scraping:

< div id = "productWrapper" >
  < div id = "hih_2126_348" class = "descriptionDetails" data - product - id = "22133545" >
    < div class = "desc" id = "hih_3_266_401" >
      < h1 id = "hih_3_266_4441" >
        < span data - title = "true" id = "hih_1_1466_99" > ProductNameHere </span>
      </h1 >

so every page that i try scraping potentially has a different value id for the div h1 span tags

the id value can change in characters/length/symbols etc

is there a way to basically scrape between for example < span data - title = "true" id = "{wildcard to allow for any value/text here}" > and </span>

the php functions i'm currently using to scrape are below.

any help would be awesome

thanks.

                // Defining the basic cURL function
                function curl($url) {
                    // Assigning cURL options to an array
                    $options = Array(
                        CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
                        CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
                        CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
                        CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
                        CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
                        CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
                        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
                        CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
                    );
                    $ch = curl_init();  // Initialising cURL
                    curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
                    $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
                    curl_close($ch);    // Closing cURL
                    return $data;   // Returning the data from the function
                }

                // Defining the basic scraping function
                function scrape_between($data, $start, $end){
                    $data = stristr($data, $start); // Stripping all data from before $start
                    $data = substr($data, strlen($start));  // Stripping $start
                    $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
                    $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
                    return $data;   // Returning the scraped data from the function
                }

Edited July 7, 2016 by seany123

kicken · July 7, 2016

Use the DOM extension and DOMXPath to process the html. Using xpath queries to target the elements will provide more flexibility and easier extraction of the data.

For example (untested)

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$query = $xpath->query('//div[@class="productDetails"]/span/text()');
$data = $query->item(0)->textContent;

Read up on XPath and how to do queries. Mine above may not be exactly right, it's been a while since I used it.

.josh · July 31, 2016

To directly answer your question, with regex you can use a negative character class to match anything that is not a quote. You didn't actually post your regex code, so a simple example matching relevant part:

preg_match('~<span id="[^"]*">~',$content,$match);

However, I agree with kicken about using a DOM parser. Depending on what exactly you are looking to grab, you can sometimes get away with parsing html with regex, but in general, regex alone cannot be used to fully and reliably parse html. Regular expressions (regex) parses regular language types (hence the name), meaning, there is an identifiable pattern to the context. HTML is a context free language, which means it is not regular. In order to reliably parse html, you need a combination of regex, loops, conditions, tokens, etc. basically, all the things that make up a DOM parser, which is exactly why we have DOM parsers.

Edited July 31, 2016 by .josh

Sign In

using wildcard to scrape between changing div id

Recommended Posts

seany123

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information