Guldstrand Posted December 11, 2009 Share Posted December 11, 2009 Can someone please help with creating a regexp for the following html-code!? <div class="item1"><b class="txt_grey">Svensk titel:</b></div> <div class="item2">Familjen Macahan</div> <div class="clfix"></div> </div> <div class="ds_spec_380_2"> <div class="item1"><b class="txt_grey">Originaltitel:</b></div> <div class="item2">How the West Was Won</div> <div class="clfix"></div> </div> <div class="ds_spec_380_2"> <div class="item1"><b class="txt_grey">Genre:</b></div> <div class="item2"><a href="ds.php?red=prod_category.php&&arg=genre@@@tvserie,,,lang@@@se,,,subsite@@@movies,,,">TV-serie</a><a href="ds.php?red=prod_category.php&&arg=genre@@@,,,lang@@@se,,,subsite@@@movies,,,"></a></div> <div class="clfix"></div> </div> <div class="ds_spec_380_2"> <div class="item1"><b class="txt_grey">Underkategori:</b></div> <div class="item2"> Äventyr<br>Kult (60-80-tal)<br> </div> <div class="clfix"></div> </div> <div class="ds_spec_380_2"> <div class="item1"><b class="txt_grey">Produktionsland:</b></div> <div class="item2"> <a href="ds.php?red=prod_category.php&&arg=genre@@@world,,,cont@@@land_USA,,,lang@@@se,,,subsite@@@movies,,,">USA</a> </div> <div class="clfix"></div> </div> <div class="ds_spec_380_2"> <div class="item1"><b class="txt_grey">Inspelningsår:</b></div> <div class="item2">1977-1979</div> <div class="clfix"></div> </div> <div class="ds_spec_380_2"> <div class="item1"><b class="txt_grey">Skådespelare:</b></div> <div class="item2"> <a href="ds.php?red=ds_person.php&&arg=id@@@19550,,,lang@@@se,,,subsite@@@movies,,,">James Arness</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@6256,,,lang@@@se,,,subsite@@@movies,,,">Bruce Boxleitner</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@4393,,,lang@@@se,,,subsite@@@movies,,,">Eva Marie Saint</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29166,,,lang@@@se,,,subsite@@@movies,,,">Kathryn Holcomb</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29167,,,lang@@@se,,,subsite@@@movies,,,">William Kirby Cullen</a><br><a href="ds.php?red=ds_person.php&&arg=id@@@29168,,,lang@@@se,,,subsite@@@movies,,,">Vicki Schreck</a> </div> <div class="clfix"></div> </div> <div class="ds_spec_380_2"> <div class="item1"><b class="txt_grey">Åldersgräns:</b></div> <div class="item2"> 15 år.<br> </div> I need to get the following: # Svensk titel # Originaltitel # Genre # Underkategori # Produktionsland # Inspelningsår # Skådespelare # Åldersgräns Thanks in advance... Quote Link to comment Share on other sites More sharing options...
salathe Posted December 11, 2009 Share Posted December 11, 2009 Does this really need to be a job for regex, or would you be open to considering other methods of retrieving that information? Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 11, 2009 Author Share Posted December 11, 2009 Does this really need to be a job for regex, or would you be open to considering other methods of retrieving that information? No it doesn´t have to be regexp, if there is a faster and/or better way to parse the info, i will go for that. Quote Link to comment Share on other sites More sharing options...
nozai Posted December 11, 2009 Share Posted December 11, 2009 I would look into an XML parser--simplexml_load_file, perhaps? Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 11, 2009 Author Share Posted December 11, 2009 I would look into an XML parser--simplexml_load_file, perhaps? This isn´t a XML-document. Quote Link to comment Share on other sites More sharing options...
cags Posted December 11, 2009 Share Posted December 11, 2009 Is it or is it not HTML? If it is then it's a markup language and as such can be parsed like one. Certainly if there's anything more you need to do then using some kind of document model would be the way to go, but since you asked, a simple Regular Expression for the pattern would be... preg_match_all('#<div class="item1"><b class="txt_grey">([^:]+):</b></div>#u', $input, $out)) But I'm not saying that's necessarily the right way to go. Quote Link to comment Share on other sites More sharing options...
nozai Posted December 11, 2009 Share Posted December 11, 2009 Touché, simplexml won't read it directly, but try: $doc = new DOMDocument(); $doc->strictErrorChecking = FALSE; $doc->loadHTML($text); $xml = simplexml_import_dom($doc); where $text contains your HTML document. Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 11, 2009 Author Share Posted December 11, 2009 Is it or is it not HTML? If it is then it's a markup language and as such can be parsed like one. Certainly if there's anything more you need to do then using some kind of document model would be the way to go, but since you asked, a simple Regular Expression for the pattern would be... preg_match_all('#<div class="item1"><b class="txt_grey">([^:]+):</b></div>#u', $input, $out)) But I'm not saying that's necessarily the right way to go. Thanks.. but i need to get the values of the words/text added in my first post. Quote Link to comment Share on other sites More sharing options...
cags Posted December 11, 2009 Share Posted December 11, 2009 It returns these... Svensk titel Originaltitel Genre Underkategori Produktionsland Inspelningsår Skådespelare Åldersgräns ...which as far as I can tell is exactly what you asked for. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted December 12, 2009 Share Posted December 12, 2009 Assuming the source looks more or less strictly like that for every film/series, here's a way to do it: <?php //$html holds the source code preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER); $data = array(); foreach ($matches as $arr) { $data[$arr[1]] = explode('<br>', $arr[2]); foreach ($data[$arr[1]] as $key => &$value) { $value = trim(strip_tags($value)); if ($value == '') { unset($data[$arr[1]][$key]); } } } echo '<pre>' . print_r($data, true) . '</pre>'; ?> Output: Array ( [svensk titel] => Array ( [0] => Familjen Macahan ) [Originaltitel] => Array ( [0] => How the West Was Won ) [Genre] => Array ( [0] => TV-serie ) [underkategori] => Array ( [0] => Äventyr [1] => Kult (60-80-tal) ) [Produktionsland] => Array ( [0] => USA ) [inspelningsår] => Array ( [0] => 1977-1979 ) [skådespelare] => Array ( [0] => James Arness [1] => Bruce Boxleitner [2] => Eva Marie Saint [3] => Kathryn Holcomb [4] => William Kirby Cullen [5] => Vicki Schreck ) [Åldersgräns] => Array ( [0] => 15 år. ) ) Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 13, 2009 Author Share Posted December 13, 2009 I´m really grateful for your help. It seems that i can´t get your code to work, i´m only getting this: Array ( ) This is the code i´m using: $html = 'http://www.discshop.se/shop/ds_produkt.php?lang=&id=76317&lang=se&subsite=movies&&ref='; preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER); $data = array(); foreach ($matches as $arr) { $data[$arr[1]] = explode('<br>', $arr[2]); foreach ($data[$arr[1]] as $key => &$value) { $value = trim(strip_tags($value)); if ($value == '') { unset($data[$arr[1]][$key]); } } } echo '<pre>' . print_r($data, true) . '</pre>'; The info i´m after, is at the bottom of the page above. (see screen) [attachment deleted by admin] Quote Link to comment Share on other sites More sharing options...
thebadbad Posted December 13, 2009 Share Posted December 13, 2009 You need to read the contents of the file into the variable. Either use file_get_contents() or cURL. And I noticed that the site uses <br/> instead of <br> for line breaks, so you'll have to change that in the code (the parameter for explode() inside the foreach loop). Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 13, 2009 Author Share Posted December 13, 2009 You need to read the contents of the file into the variable. Either use file_get_contents() or cURL. And I noticed that the site uses <br/> instead of <br> for line breaks, so you'll have to change that in the code (the parameter for explode() inside the foreach loop). Now it works much better. Thanks! Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 16, 2009 Author Share Posted December 16, 2009 Assuming the source looks more or less strictly like that for every film/series, here's a way to do it: <?php //$html holds the source code preg_match_all('~<div class="item1"><b class="txt_grey">([^:]+):</b></div>\s*<div class="item2">(.*?)</div>~is', $html, $matches, PREG_SET_ORDER); $data = array(); foreach ($matches as $arr) { $data[$arr[1]] = explode('<br>', $arr[2]); foreach ($data[$arr[1]] as $key => &$value) { $value = trim(strip_tags($value)); if ($value == '') { unset($data[$arr[1]][$key]); } } } echo '<pre>' . print_r($data, true) . '</pre>'; ?> Hi again... Is there a quick and easy way to show the output in a nicer/better way? Quote Link to comment Share on other sites More sharing options...
thebadbad Posted December 16, 2009 Share Posted December 16, 2009 Sure, you can e.g. output the data in a table (in its most simple form in this example): <?php echo '<table>'; foreach ($data as $key => $val) { echo "\n\t<tr>\n\t\t<td>$key</td><td>" . implode('<br />', $val) . "</td>\n\t</tr>"; } echo "\n</table>"; ?> Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 16, 2009 Author Share Posted December 16, 2009 Sure, you can e.g. output the data in a table (in its most simple form in this example): <?php echo '<table>'; foreach ($data as $key => $val) { echo "\n\t<tr>\n\t\t<td>$key</td><td>" . implode('<br />', $val) . "</td>\n\t</tr>"; } echo "\n</table>"; ?> Woow.. thanks. I can´t thank you enough. One last thing tho.. How to get the image (both thumb and full size) and the price (299:-) WITHOUT the ":-" sign, from the same url? Ans is it possible to have more regexp:s in the same "instance", or do i need to create a whole new one? Here is the html: <img style="vertical-align: bottom;" src="http://www.discshop.se/shop/img/omslag/front_normal/7/76317.jpg" class="reflected" alt="Familjen Macahan - Säsong 1 (4-disc)" border="0" height="170" hspace="0" vspace="0" width="120"><canvas width="120" height="34" style="height: 34px; width: 120px;"></canvas></div></a></div> <a href="javascript:void(0);" onclick="window.open('coverview.php?id=76317&side=front','bakom','height=350,width=500,status=no,toolbar=no,directories=no,menubar=no,location=no,resizable=yes,scrollbars=no');">Visa stor framsida</a><br> </div> <div style="margin-bottom: 10px;"> <div style="margin-bottom: 5px;"> <span class="price " style="margin-bottom: 5px;"><span class="price_normal">299:-</span> Quote Link to comment Share on other sites More sharing options...
thebadbad Posted December 16, 2009 Share Posted December 16, 2009 I would probably do that with two separate patterns: <?php //get image link(s) preg_match('~<img src="http://www\.discshop\.se/shop/img/omslag/front_normal/([^"]+)"~i', $html, $match); //build image URLs $thumb = 'http://www.discshop.se/shop/img/omslag/front_normal/' . $match[1]; $full = 'http://www.discshop.se/shop/img/omslag/front_large/' . $match[1]; //get price preg_match('~<span class="price "[^>]*><span\b[^>]*>([^:]+):~i', $html, $match); $price = $match[1]; ?> Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 26, 2009 Author Share Posted December 26, 2009 Hi again... What if i want to show (parse) search results from multiple sites? Site 1 HTML-output: <th colspan="5">Filmtitel - 10 träffar</th> Site 2 HTML-output: <div style="margin: 8px; float: right; color: rgb(102, 102, 102);"><b>10 träffar</b></div> Site 3 HTML-output: <span id="ctl00_ContentPlaceHolder1_m_nbrofHitsOnWhat">Din sökning på "scrubs" resulterade i 8 träffar</span> Site 4 HTML-output: Visar 1 - 2 av <strong>2</strong> annonser Site 5 HTML-output: <strong>Dvd- & vhs-filmer (1)</strong> Site 6 HTML-output: <span id="SearchResultMessage"> 9 objekt hittade för "scrubs" i kategorin DVD & Videofilmer</span> Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 29, 2009 Author Share Posted December 29, 2009 *bump* Quote Link to comment Share on other sites More sharing options...
cags Posted December 29, 2009 Share Posted December 29, 2009 It's impossible unless you provide a specific set of rules for each individual site. There is no possible way a computer could work out which part is the correct number from a single pattern. Quote Link to comment Share on other sites More sharing options...
Guldstrand Posted December 29, 2009 Author Share Posted December 29, 2009 It's impossible unless you provide a specific set of rules for each individual site. There is no possible way a computer could work out which part is the correct number from a single pattern. Yes, i know that i somehow need to write a regexp for each site. But after that, how can i show all results the best/easiest way? Quote Link to comment Share on other sites More sharing options...
cags Posted December 29, 2009 Share Posted December 29, 2009 The same way your currently doing it, only adding all results in. Regardless of the pattern used I assume you are going to be collecting the same information from each ones ie bus number, place etc. Therefore you will have the same number of capture groups. You can simply use array_merge to combine all the result sets into a large array, then simply loop through the array in the same manner you are currently using. If you find that the capture groups aren't in the same place you can perhaps use named capture groups so that the information can still be easily iterated through. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.