phoenixx Posted January 2, 2009 Share Posted January 2, 2009 I'm using the following line as an example of what I'm scraping (there are multiple instances on the page) Scraping: <table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table> Output should be: 2120T-TOP-N and ESTELLE DINING TABLE TOP (W/2x18"L) respectively. Here's the code I'm using: preg_match_all('/<table cellspacing=0>.*?<tr><td valign=top; nowrap> <strong> .*?<\/strong>.*?<\/strong> <\/td><td valign=top>.([^"]*)<\/td><\/tr><\/table>/is',$data2,$out2); $a = array_combine($out2[1], $out2[2]); foreach($a as $b=>$c){ echo "<b>Product Number: </b>" . $b . " | <b>Description: </b>" . $c . "<br>"; } As you might guess I'm getting the following error: Warning: array_combine() expects parameter 2 to be array, null given in /home/xxxxxx/public_html/sandbox/scraper.php on line 33 Warning: Invalid argument supplied for foreach() in /home/xxxxxx/public_html/sandbox/scraper.php on line 34 Any help would be greatly appreciated. I will reward you with a full night's sleep before you leave this earth. Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/ Share on other sites More sharing options...
nrg_alpha Posted January 2, 2009 Share Posted January 2, 2009 Here's my take on it: $str = '<table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table>'; preg_match_all('#<td\b(??!>).*?(2120T-TOP-N)).*?<td\b(??!>).*?(ESTELLE DINING TABLE TOP \(W/2X18"L\)))#s', $str, $matches); for($i = 1, $total = count($matches); $i < $total; $i++){ echo $matches[$i][0] . '<br />'; } Output: 2120T-TOP-N ESTELLE DINING TABLE TOP (W/2X18"L) Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/#findComment-728311 Share on other sites More sharing options...
phoenixx Posted January 2, 2009 Author Share Posted January 2, 2009 The data I need to pull is dynamic. That code is only useful if I already know what the value of the output is going to be. Here's an example (and there are thousands of pages in the site I'm scraping): Each SKU and each description need to be a separate value. Each page (of the thousands) is just like this with different sku numbers and data on it. Qty SKU # 1 2120T-TOP-N ESTELLE DINING TABLE TOP (W/2X18"L) 1 2120T-LEG-N DOUBLE PEDESTAL TABLE BASE 31.5"H 2 2120A-ASSEMBLED ESTELLE ARM CHAIR 44"H 4 2120S-ASSEMBLED ESTELLE SIDE CHAIR 44"H 1 2120-B-N ESTELLE BUFFET 33"H 1 2120-H-N ESTELLE HUTCH 56 3/4"H Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/#findComment-728353 Share on other sites More sharing options...
nrg_alpha Posted January 2, 2009 Share Posted January 2, 2009 The data I need to pull is dynamic. That code is only useful if I already know what the value of the output is going to be. Here's an example (and there are thousands of pages in the site I'm scraping): Each SKU and each description need to be a separate value. Each page (of the thousands) is just like this with different sku numbers and data on it. Qty SKU # 1 2120T-TOP-N ESTELLE DINING TABLE TOP (W/2X18"L) 1 2120T-LEG-N DOUBLE PEDESTAL TABLE BASE 31.5"H 2 2120A-ASSEMBLED ESTELLE ARM CHAIR 44"H 4 2120S-ASSEMBLED ESTELLE SIDE CHAIR 44"H 1 2120-B-N ESTELLE BUFFET 33"H 1 2120-H-N ESTELLE HUTCH 56 3/4"H Yeah, I realised I should have included what I am about to in my initial post, but it timed out, and thus I could not edit it. Is your SKU always going to start with 212? By the way, reference for next time, please provide some multiple end results like this in your initial post, as this will help others solve the issue much more easily (as I can only work with what you give me). Not many people know how to ask / explain regex porblems / solutions properly (case in point: this thread). Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/#findComment-728372 Share on other sites More sharing options...
phoenixx Posted January 2, 2009 Author Share Posted January 2, 2009 Here's a sample of the full code I'm scraping. No, unfortunately there are no data consistencies between pages or items on a page other than the html structure. Sorry about the confusion. I appreciate the help. <p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron. Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 2<strong> 2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 4<strong> 2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p> Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/#findComment-728381 Share on other sites More sharing options...
nrg_alpha Posted January 2, 2009 Share Posted January 2, 2009 Here's a sample of the full code I'm scraping. No, unfortunately there are no data consistencies between pages or items on a page other than the html structure. Sorry about the confusion. I appreciate the help. <p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron. Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 2<strong> 2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 4<strong> 2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p> Wow.. ok.. this makes for quite the slippery slope to navigate.. I'll cut and paste this string and reformat it to make heads and tails of it and see if what I come up with... Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/#findComment-728392 Share on other sites More sharing options...
nrg_alpha Posted January 3, 2009 Share Posted January 3, 2009 Ok, so here is what I came up (explanations to follow): error_reporting(E_ALL); // keep this here to check that you code doesn't cough up warnings and / or errors. $str = <<<DATA <p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron. Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 2<strong> 2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 4<strong> 2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p> DATA; $arr = array(); $arrTemp = preg_split('#<td[^>]*>#', $str); for($i = 0, $total = count($arrTemp); $i < $total; $i++){ $arrTemp[$i] = strip_tags(trim(preg_replace('#(?: )+#', ' ', $arrTemp[$i]))); $arrTemp[$i] = preg_replace('~[^\w()#]+$~', '', $arrTemp[$i]); if(!empty($arrTemp[$i])){ $arr[] = $arrTemp[$i]; } } echo '<pre>'.print_r($arr, true); Ouput (via print_r): Array ( [0] => SETD2120 ESTELLE DINING [1] => Bookmatched cherry veneer double pedestal table with hand carved solid wood apron. Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs [2] => SETD2120 ESTELLE DININGQty SKU # [3] => 1 2120T-TOP-N [4] => ESTELLE DINING TABLE TOP (W/2X18"L) [5] => 1 2120T-LEG-N [6] => DOUBLE PEDESTAL TABLE BASE 31.5"H [7] => 2 2120A-ASSEMBLED [8] => ESTELLE ARM CHAIR 44"H [9] => 4 2120S-ASSEMBLED [10] => ESTELLE SIDE CHAIR 44"H [11] => 1 2120-B-N [12] => ESTELLE BUFFET 33"H [13] => 1 2120-H-N [14] => ESTELLE HUTCH 56 3/4"H ) What's I've done is have a look at that string, and for my own sake, started to partially reformat it so I can see what is going on.. the only pattern from a structual standpoint that I can see is that all the 'meat and potatoes' you seek seems to be found within the <td.> tags. So to save some vertical space, I simply ended up placing your sample string into a Heredoc (instead of using the patrially formated version I worked off of within Heredoc). I used this to preg_split everything (using the opening <td....> tags as my pattern.. then I started to systematically replace all duplicate spaces with a single literal one, which in turn is trimmed then stripped of all tags via strip_tags(). After all was said and done, I was still left with gaping holes (presumably from the removed tags).. so I used preg one last time to remove anything (from the end) not a word character, brackets or the # character (if you look at the above output, some of those items end with non word characters (and since those are not word characters, they were by default removed if I simply stripped away the ending with \W.. so I had to 'protect them' from being wiped out. If in your code scraping, you find any other characters at the end being wiped out, simply update the [^\w()#] part in the last preg_replace line (since it is in a character class, you will not need to escape stuff like brackets, dots and such.. if you do add a dash however, make sure it is added as the very first character or the very last!)). And lastly, I put all the non-empty array elements into the final $arr variable. Now, to display them in pairs, you could add the following code: for($i = 0, $total = count($arr); $i < $total; $i++){ echo $arr[$i]; if($i % 2){ echo '<br />'; } else { echo '<br />----------------------------------<br />'; } } This checks to see if each entry when divided by 2 is equal to 1 (true), and if so, echo the one result, and if not, echo out the other instead (I was going to do this in ternary operator notation, but opted not to.. not really important). Things to be mindful of.. poorly structured tables form the code you use can wreak havok with the results... If the content you seek is lumped together within a single <td> tag, I cannot differentiate this.. so you will not get correct results if this occures.. Odd total of entries within the table code will leave an 'odd-ball out' in the listing if in fact you do care about grouping them in clumps of 2 (as obviously, how do you do this with 13 or 15 entires found? The above example shows this). Well, that's about it for me. It's all in your hands now. Hope this is what you were looking for. Cheers Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/#findComment-728485 Share on other sites More sharing options...
sasa Posted January 3, 2009 Share Posted January 3, 2009 try <?php $test = '<p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron. Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 2<strong> 2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 4<strong> 2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap> 1<strong> 2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p>'; $pattern = '~<table cellspacing=0><tr><td valign=top; nowrap>.*?<strong>( )*([^<]+)</strong>[^<]+</td><td valign=top>([^<]+)</td></tr></table>~'; preg_match_all($pattern, $test, $out); $out = array_combine($out[2], $out[3]); print_r($out); ?> Quote Link to comment https://forums.phpfreaks.com/topic/139235-i-must-need-sleep-or-another-blue-monster-easy-regex/#findComment-728840 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.