Jump to content

I must need sleep or another Blue Monster - Easy Regex


phoenixx

Recommended Posts

I'm using the following line as an example of what I'm scraping (there are multiple instances on the page)

 

Scraping:

<table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table>

 

Output should be:

2120T-TOP-N and ESTELLE DINING TABLE TOP (W/2x18"L) respectively.

 

Here's the code I'm using:

preg_match_all('/<table cellspacing=0>.*?<tr><td valign=top; nowrap> <strong>  .*?<\/strong>.*?<\/strong> <\/td><td valign=top>.([^"]*)<\/td><\/tr><\/table>/is',$data2,$out2);
$a = array_combine($out2[1], $out2[2]);
foreach($a as $b=>$c){
echo "<b>Product Number: </b>" . $b . "    |    <b>Description: </b>" . $c . "<br>";
}

 

As you might guess I'm getting the following error:

Warning: array_combine() expects parameter 2 to be array, null given in /home/xxxxxx/public_html/sandbox/scraper.php on line 33

 

Warning: Invalid argument supplied for foreach() in /home/xxxxxx/public_html/sandbox/scraper.php on line 34

 

 

Any help would be greatly appreciated.  I will reward you with a full night's sleep before you leave this earth.

Link to comment
Share on other sites

Here's my take on it:

 

$str = '<table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table>';
preg_match_all('#<td\b(??!>).*?(2120T-TOP-N)).*?<td\b(??!>).*?(ESTELLE DINING TABLE TOP \(W/2X18"L\)))#s', $str, $matches);
for($i = 1, $total = count($matches); $i < $total; $i++){
   echo $matches[$i][0] . '<br />';
}

 

Output:

2120T-TOP-N
ESTELLE DINING TABLE TOP (W/2X18"L)

Link to comment
Share on other sites

The data I need to pull is dynamic.  That code is only useful if I already know what the value of the output is going to be.  Here's an example (and there are thousands of pages in the site I'm scraping):  Each SKU and each description need to be a separate value.  Each page (of the thousands) is just like this with different sku numbers and data on it.

 

Qty SKU #

  1  2120T-TOP-N ESTELLE DINING TABLE TOP (W/2X18"L)

  1  2120T-LEG-N DOUBLE PEDESTAL TABLE BASE 31.5"H

  2  2120A-ASSEMBLED ESTELLE ARM CHAIR 44"H

  4  2120S-ASSEMBLED ESTELLE SIDE CHAIR 44"H

  1  2120-B-N ESTELLE BUFFET 33"H

  1  2120-H-N ESTELLE HUTCH 56 3/4"H

Link to comment
Share on other sites

The data I need to pull is dynamic.  That code is only useful if I already know what the value of the output is going to be.  Here's an example (and there are thousands of pages in the site I'm scraping):  Each SKU and each description need to be a separate value.  Each page (of the thousands) is just like this with different sku numbers and data on it.

 

Qty SKU #

  1  2120T-TOP-N ESTELLE DINING TABLE TOP (W/2X18"L)

  1  2120T-LEG-N DOUBLE PEDESTAL TABLE BASE 31.5"H

  2  2120A-ASSEMBLED ESTELLE ARM CHAIR 44"H

  4  2120S-ASSEMBLED ESTELLE SIDE CHAIR 44"H

  1  2120-B-N ESTELLE BUFFET 33"H

  1  2120-H-N ESTELLE HUTCH 56 3/4"H

 

Yeah, I realised I should have included what I am about to in my initial post, but it timed out, and thus I could not edit it.

Is your SKU always going to start with 212?

 

By the way, reference for next time, please provide some multiple end results like this in your initial post, as this will help others solve the issue much more easily (as I can only work with what you give me). Not many people know how to ask / explain regex porblems / solutions properly (case in point: this thread).

Link to comment
Share on other sites

Here's a sample of the full code I'm scraping.  No, unfortunately there are no data consistencies between pages or items on a page other than the html structure.  Sorry about the confusion.  I appreciate the help.

 

<p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron.  Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   2<strong>  2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   4<strong>  2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p>

Link to comment
Share on other sites

Here's a sample of the full code I'm scraping.  No, unfortunately there are no data consistencies between pages or items on a page other than the html structure.  Sorry about the confusion.  I appreciate the help.

 

<p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron.  Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   2<strong>  2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   4<strong>  2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p>

 

Wow.. ok.. this makes for quite the slippery slope to navigate.. I'll cut and paste this string and reformat it to make heads and tails of it and see if what I come up with...

Link to comment
Share on other sites

Ok, so here is what I came up (explanations to follow):

 

error_reporting(E_ALL); // keep this here to check that you code doesn't cough up warnings and / or errors.
$str = <<<DATA
<p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron.  Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   2<strong>  2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   4<strong>  2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p>
DATA;

$arr = array();
$arrTemp = preg_split('#<td[^>]*>#', $str);
for($i = 0, $total = count($arrTemp); $i < $total; $i++){
   $arrTemp[$i] = strip_tags(trim(preg_replace('#(?: )+#', ' ', $arrTemp[$i])));
   $arrTemp[$i] = preg_replace('~[^\w()#]+$~', '', $arrTemp[$i]);
   if(!empty($arrTemp[$i])){
      $arr[] = $arrTemp[$i];
   }
}
echo '<pre>'.print_r($arr, true);

 

Ouput (via print_r):

Array
(
    [0] => SETD2120 ESTELLE DINING
    [1] => Bookmatched cherry veneer double pedestal table with hand carved solid wood apron.  Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs
    [2] => SETD2120 ESTELLE DININGQty SKU #
    [3] => 1 2120T-TOP-N
    [4] => ESTELLE DINING TABLE TOP (W/2X18"L)
    [5] => 1 2120T-LEG-N
    [6] => DOUBLE PEDESTAL TABLE BASE 31.5"H
    [7] => 2 2120A-ASSEMBLED
    [8] => ESTELLE ARM CHAIR 44"H
    [9] => 4 2120S-ASSEMBLED
    [10] => ESTELLE SIDE CHAIR 44"H
    [11] => 1 2120-B-N
    [12] => ESTELLE BUFFET 33"H
    [13] => 1 2120-H-N
    [14] => ESTELLE HUTCH 56 3/4"H
)

 

What's I've done is have a look at that string, and for my own sake, started to partially reformat it so I can see what is going on.. the only pattern from a structual standpoint that I can see is that all the 'meat and potatoes' you seek seems to be found within the <td.> tags.

So to save some vertical space, I simply ended up placing your sample string into a Heredoc (instead of using the patrially formated version I worked off of within Heredoc).

I used this to preg_split everything (using the opening <td....> tags as my pattern.. then I started to systematically replace all duplicate   spaces with a single literal one, which in turn is trimmed then stripped of all tags via strip_tags().

 

After all was said and done, I was still left with gaping holes (presumably from the removed tags).. so I used preg one last time to remove anything (from the end) not a word character, brackets or the # character (if you look at the above output, some of those items end with non word characters (and since those are not word characters, they were by default removed if I simply stripped away the ending with \W.. so I had to 'protect them' from being wiped out. If in your code scraping, you find any other characters at the end being wiped out, simply update the [^\w()#] part in the last preg_replace line (since it is in a character class, you will not need to escape stuff like brackets, dots and such.. if you do add a dash however, make sure it is added as the very first character or the very last!)). And lastly, I put all the non-empty array elements into the final $arr variable.

 

Now, to display them in pairs, you could add the following code:

for($i = 0, $total = count($arr); $i < $total; $i++){
   echo $arr[$i];
   if($i % 2){
      echo '<br />';
   } else {
      echo '<br />----------------------------------<br />';
  }
}

 

This checks to see if each entry when divided by 2 is equal to 1 (true), and if so, echo the one result, and if not, echo out the other instead (I was going to do this in ternary operator notation, but opted not to.. not really important).

 

Things to be mindful of..

 

  • poorly structured tables form the code you use can wreak havok with the results...
  • If the content you seek is lumped together within a single <td> tag, I cannot differentiate this.. so you will not get correct results if this occures..
  • Odd total of entries within the table code will leave an 'odd-ball out' in the listing if in fact you do care about grouping them in clumps of 2 (as obviously, how do you do this with 13 or 15 entires found? The above example shows this).

 

Well, that's about it for me. It's all in your hands now. Hope this is what you were looking for.

 

Cheers

Link to comment
Share on other sites

try

<?php
$test = '<p class="smallTitles" align="center">SETD2120 ESTELLE DINING</p><table width="625" border="0" cellspacing="0" cellpadding="0"><tr><td colspan="3" align="middle" valign="middle"><img src="/wc/Pictures/2120.jpg"></td></tr><tr><td width="300" align="left" valign="top">Bookmatched cherry veneer double pedestal table with hand carved solid wood apron.  Oversized back chair with brass nailheads, hand carved solid wood arms and pillaster legs.</td><td width="22" align="left" valign="top"> </td><td width="350" align="left" valign="top"><span class="catalogDesc"><Strong>SETD2120</strong> ESTELLE DINING<br /><br /><br />Qty SKU # <br /><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-TOP-N</strong> </td><td valign=top>ESTELLE DINING TABLE TOP (W/2X18"L)</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120T-LEG-N</strong> </td><td valign=top>DOUBLE PEDESTAL TABLE BASE 31.5"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   2<strong>  2120A-ASSEMBLED</strong> </td><td valign=top>ESTELLE ARM CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   4<strong>  2120S-ASSEMBLED</strong> </td><td valign=top>ESTELLE SIDE CHAIR 44"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-B-N</strong> </td><td valign=top>ESTELLE BUFFET 33"H</td></tr></table><table cellspacing=0><tr><td valign=top; nowrap>   1<strong>  2120-H-N</strong> </td><td valign=top>ESTELLE HUTCH 56 3/4"H</td></tr></table></td></span></td></tr></table><p> </p>';
$pattern = '~<table cellspacing=0><tr><td valign=top; nowrap>.*?<strong>( )*([^<]+)</strong>[^<]+</td><td valign=top>([^<]+)</td></tr></table>~';
preg_match_all($pattern, $test, $out);
$out = array_combine($out[2], $out[3]);
print_r($out);
?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.