seany123 Posted July 7, 2016 Share Posted July 7, 2016 (edited) Im currently doing website scraping for the first time, although im getting success, im finding that for when a piece of code is dynamic, then im having to do alot of additional lines when if i could do something like <div id"{wildcard for any value}"> then it would reduce the lines required to collect the data, here is an example of the html im scraping: < div id = "productWrapper" > < div id = "hih_2126_348" class = "descriptionDetails" data - product - id = "22133545" > < div class = "desc" id = "hih_3_266_401" > < h1 id = "hih_3_266_4441" > < span data - title = "true" id = "hih_1_1466_99" > ProductNameHere </span> </h1 > so every page that i try scraping potentially has a different value id for the div h1 span tags the id value can change in characters/length/symbols etc is there a way to basically scrape between for example < span data - title = "true" id = "{wildcard to allow for any value/text here}" > and </span> the php functions i'm currently using to scrape are below. any help would be awesome thanks. // Defining the basic cURL function function curl($url) { // Assigning cURL options to an array $options = Array( CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function ); $ch = curl_init(); // Initialising cURL curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function } // Defining the basic scraping function function scrape_between($data, $start, $end){ $data = stristr($data, $start); // Stripping all data from before $start $data = substr($data, strlen($start)); // Stripping $start $stop = stripos($data, $end); // Getting the position of the $end of the data to scrape $data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape return $data; // Returning the scraped data from the function } Edited July 7, 2016 by seany123 Quote Link to comment Share on other sites More sharing options...
kicken Posted July 7, 2016 Share Posted July 7, 2016 Use the DOM extension and DOMXPath to process the html. Using xpath queries to target the elements will provide more flexibility and easier extraction of the data. For example (untested) $dom = new DOMDocument(); $dom->loadHTML($html); $xpath = new DOMXPath($dom); $query = $xpath->query('//div[@class="productDetails"]/span/text()'); $data = $query->item(0)->textContent; Read up on XPath and how to do queries. Mine above may not be exactly right, it's been a while since I used it. Quote Link to comment Share on other sites More sharing options...
.josh Posted July 31, 2016 Share Posted July 31, 2016 (edited) To directly answer your question, with regex you can use a negative character class to match anything that is not a quote. You didn't actually post your regex code, so a simple example matching relevant part: preg_match('~<span id="[^"]*">~',$content,$match); However, I agree with kicken about using a DOM parser. Depending on what exactly you are looking to grab, you can sometimes get away with parsing html with regex, but in general, regex alone cannot be used to fully and reliably parse html. Regular expressions (regex) parses regular language types (hence the name), meaning, there is an identifiable pattern to the context. HTML is a context free language, which means it is not regular. In order to reliably parse html, you need a combination of regex, loops, conditions, tokens, etc. basically, all the things that make up a DOM parser, which is exactly why we have DOM parsers. Edited July 31, 2016 by .josh Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.