Jump to content

[SOLVED] Trick or Treat. Instead of Candy Can I Get Some Help With This Expression?


Modernvox

Recommended Posts

Just wondering if I am using the proper expression here for the following string:

sale-k9zsn-1446270798@craigslist.org //This is the string I am trying to copy.

 

'#<sale-([a-z0-9]?/\d{10}.html)">#'

 

The reason why that pattern doesn't work is that the character class ([a-z0-9]) checks for a single character to see if it matches what is inside its square brackets... so [a-z0-9]? means, at the current location in the source srting, check to see if that character (singular) is either a-z or 0-9, of which is optional. so that means after the pattern matches sale-k, (the k part is what the character class matches) it moves on from the character class to the rest of the pattern (which is a forward slash, followed by 10 digits dot html). Since there is no forward slash after sale-k, this causes the pattern to ultimately fail. Garethp provided a solution you seek.

 

On a side note, I noticed that your pattern contains < and >. I can only assume that you require these as there might be some instances of sale-.... elswhere that fits the pattern but isn't contained in < and >? If not, then you can omit these from the pattern altogether. Yet another solution could involve not using captures at all:

 

Example:

<?php
$str = '<sale-k9zsn-1446270798@craigslist.org>';
preg_match('#<sale-\K[a-z0-9]+-\d+@craigslist\.org(?=>)#', $str, $match);
echo $match[0]; // Output: k9zsn-1446270798@craigslist.org
?>

 

I have to amend the \K blog article I wrote, chiefly because it turns out \K is an escape sequence and not an assertion after all.. but it still reflects the benefit of using this rarely used sequence. So in conjunction with that and using the look ahead assertion (?=>) at the end, we eliminate the need to capture using parenthesis and simply use what is ultimately kept within the base pattern match ($match[0]).

Link to comment
Share on other sites

lol..Yeah I'm stuck on stupid!

 

<sale-([a-z0-9]?/\d{10}@craigslist.org)">#  I mean!

 

Here are a few questions, you can either answer them or just give them some thoughts towards arriving at a solution.

 

  • The angled brackets, are they supposed to be part of the string you are matching, or delimiters around the regular expression?
  • Do you need to to capture the "k9zsn-1446270798@craigslist.org" bit or the entire email address?
  • You don't need the forward slash before the 10-digit number sequence do you? You do instead need a dash/hyphen there?
  • Do you need to match zero-or-one character for the "k9zsn" part, or more-than-one? Currently you're doing the former where you probably want the latter: [a-z0-9]+
  • Do you want to match "any non-newline" character between "craigslist" and "org", or literally a dot? Currently you do the former, where you want the latter: craigslist\.org

 

 

I have to amend the \K blog article I wrote, chiefly because it turns out \K is an escape sequence and not an assertion after all.. but it still reflects the benefit of using this rarely used sequence. So in conjunction with that and using the look ahead assertion (?=>) at the end, we eliminate the need to capture using parenthesis and simply use what is ultimately kept within the base pattern match ($match[0]).

I think this may be over-complicating things for the moment. That said, do let me know if/when you revise the article. :shy:

Link to comment
Share on other sites

I have to amend the \K blog article I wrote, chiefly because it turns out \K is an escape sequence and not an assertion after all.. but it still reflects the benefit of using this rarely used sequence. So in conjunction with that and using the look ahead assertion (?=>) at the end, we eliminate the need to capture using parenthesis and simply use what is ultimately kept within the base pattern match ($match[0]).

I think this may be over-complicating things for the moment. That said, do let me know if/when you revise the article. :shy:

 

If you are referring to the pattern being over complicated, it's merely an alternative solution that doesn't require capturing (assuming the ultimate goal is to fetch 'k9zsn-1446270798@craigslist.org'). While the addition of \K and the look ahead assertion adds more to the pattern, once the OP understands it, it isn't that complicated imo. Granted, there's no real harm in using a capture either. Many ways to skin a cat they say.

 

And yes, I'll get around to amending the article (not sure when.. but it's a matter of when, not if) and will let you know.

Link to comment
Share on other sites

lol..Yeah I'm stuck on stupid!

 

<sale-([a-z0-9]?/\d{10}@craigslist.org)">#  I mean!

 

Here are a few questions, you can either answer them or just give them some thoughts towards arriving at a solution.

 

  • The angled brackets, are they supposed to be part of the string you are matching, or delimiters around the regular expression?
  • Do you need to to capture the "k9zsn-1446270798@craigslist.org" bit or the entire email address?
  • You don't need the forward slash before the 10-digit number sequence do you? You do instead need a dash/hyphen there?
  • Do you need to match zero-or-one character for the "k9zsn" part, or more-than-one? Currently you're doing the former where you probably want the latter: [a-z0-9]+
  • Do you want to match "any non-newline" character between "craigslist" and "org", or literally a dot? Currently you do the former, where you want the latter: craigslist\.org

 

 

I have to amend the \K blog article I wrote, chiefly because it turns out \K is an escape sequence and not an assertion after all.. but it still reflects the benefit of using this rarely used sequence. So in conjunction with that and using the look ahead assertion (?=>) at the end, we eliminate the need to capture using parenthesis and simply use what is ultimately kept within the base pattern match ($match[0]).

I think this may be over-complicating things for the moment. That said, do let me know if/when you revise the article. :shy:

 

Hey Salathe' How have you been? 

I will answer your questions as well as respond to others who have joined in this thread to help me out.

[*] I need to capture the whole email address

[*]All email addresses begin with the <sale, have a combination of 5 characters , 10 numbers and end with craigslist.org

I have included two screenshots of what i have as far as email addresses and links to open.

Here is my current screen/progress:

http://i266.photobucket.com/albums/ii246/Pencilman_2008/links-3.jpg

 

Here is a sample of what a few email links look like:

job-brhfx-1446929970@craigslist.org

job-gqfvk-1446491529@craigslist.org

 

Now, not to confuse anyone , but I am wanting to open the links one by one and check if there is the email address within. I haven't managed to accomplish this as of yet because as you see in my screen shot there is just one link open and i don't know how to close it and open the next one? I don't even know of this is possible to do?

 

I have been trying this for 2 weeks straight.

 

Link to comment
Share on other sites

Well you have said in more than on place that the e-mail starts with <sale, yet in your example it doesn't. If the pattern your trying to match doesn't have a less than sign at the start that will not match. Infact the examples you gave don't even begin with sale, they begin with job... It's going to be practically impossible for us to come up with an accurate pattern without knowing exactly what your trying to match.

Link to comment
Share on other sites

Well you have said in more than on place that the e-mail starts with <sale, yet in your example it doesn't. If the pattern your trying to match doesn't have a less than sign at the start that will not match. Infact the examples you gave don't even begin with sale, they begin with job... It's going to be practically impossible for us to come up with an accurate pattern without knowing exactly what your trying to match.

 

Yeah. This project has been all over the place thus far. Confusing even myself.

All emails start with sale:

sale-cd46s-1448112725@craigslist.org

sale-as7r3-1448111272@craigslist.org

sale-cftfz-1448110233@craigslist.org  //You get the idea.

 

My 2nd foreach loop is doing nothing because $link is not an array.

I currently have a list of links (As i have provided above on photobucket) and "one" open link at the bottom of my screen. This open link shows the "Golden Egg" I mean email address I desire to put in my DB.

 

I believe I have the correct REGEX Expression, but not the right preg_match parameters?

 

Thanks again,

 

Oh yeah even after I capture this email, How will i open the next link and grab that email?

Link to comment
Share on other sites

No you do not have the correct regex, as far as i can tell there is no 'less than' (<) character anywhere in the pattern your trying to match, so there shouldn't be one in the regular expression.

 

Ok. So i was off by one character...I eliminated that <. That doesn't solve the problem anyhow.

 

I get this error when running my code Warning: preg_match() expects parameter 2 to be string, array given in C:\xampp\htdocs\test5.php on line 35

 



<?php

    function curlURL($url) { 
        $curl = curl_init(); 
        curl_setopt($curl, CURLOPT_URL, $url); 
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); 
        curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2'); 
        $output = curl_exec($curl); 
            return $output; 
    } 
     
   $curlResults = curlURL("http://southcoast.craigslist.org/sss/"); 
   $pattern = '#<a href="(/[a-z]{3}/\d{10}\.html)">#';
   preg_match_all( $pattern, $curlResults, $matches);

echo "<pre>\n";
echo "Links:\n\n";
foreach ($matches[1] as $link):
   echo "\t" . '<a href="' . $link . '" target="_BLANK">' . $link . '</a>' . "\n";
endforeach;
echo '</pre>';
echo file_get_contents("http://southcoast.craigslist.org".$link);
$pattern = '~sale-([a-z0-9]+-\d+@craigslist\.org)~';  //This is the attempted match for the email
preg_match( $pattern, $matches);
foreach ($matches[1] as $address):

   $dbx= mysql_connect("localhost", "root", "");   //include before any database implematation
   if (!$dbx)
   {
      die('Could not connect: ' . mysql_error());
   }
   
   mysql_select_db("craigslist", $dbx);
   mysql_query("INSERT INTO `addresses` (`sale_items`) VALUES ('$address')") or mysql_error();
      
   mysql_close($dbx);
endforeach;
?>

Link to comment
Share on other sites

I believe this basically does everything your after...

 

$links = fetch_links("http://southcoast.craigslist.org/sss/");
$sets = array();
foreach($links as $link) {
   $sets[] = fetch_email($link);
}

foreach($sets as $set) {
   // sql / mailing / whatever you want to do for each set
   // as an example

   echo "Link = " . $set['link'] ", E-mail = " . $set['emai'] . "<br/>";
}

function fetch_links($page_url) {
   $pattern = '#<a href="(/[a-z]{3}/\d{10}\.html)">#';
   $page = file_get_contents($page_url);
   preg_match_all($pattern, $page, $matches);
   return $matches[1];
}

function fetch_email($page_link) {
   $pattern = '#(sale-[a-z0-9]+-\d+@craigslist\.org)#';
   $page = file_get_contents("http://southcoast.craigslist.org" . $page_link);
   preg_match($pattern, $page, $out);
   return array('link'=>$page_link, 'email'=>$out[1]);
}

Link to comment
Share on other sites

I believe this basically does everything your after...

 

$links = fetch_links("http://southcoast.craigslist.org/sss/");
$sets = array();
foreach($links as $link) {
   $sets[] = fetch_email($link);
}

foreach($sets as $set) {
   // sql / mailing / whatever you want to do for each set
   // as an example

   echo "Link = " . $set['link'] ", E-mail = " . $set['emai'] . "<br/>";
}

function fetch_links($page_url) {
   $pattern = '#<a href="(/[a-z]{3}/\d{10}\.html)">#';
   $page = file_get_contents($page_url);
   preg_match_all($pattern, $page, $matches);
   return $matches[1];
}

function fetch_email($page_link) {
   $pattern = '#(sale-[a-z0-9]+-\d+@craigslist\.org)#';
   $page = file_get_contents("http://southcoast.craigslist.org" . $page_link);
   preg_match($pattern, $page, $out);
   return array('link'=>$page_link, 'email'=>$out[1]);
}

 

I sure am grateful, but damn you changed everything. I am trying to use Curl. I will look at this for a while even though it is spitting out errors as of now.

Link to comment
Share on other sites

Yes in your original code you had one bit of curl, but you also used file_get_contents. As far as I could tell you weren't using a single feature that curl supported that makes it advantageous over the simpler file_get_contents. If needed the curl can be added back in, but it'll add like 10lines and not really improve anything.

 

I didn't really check the code very thoroughly I just re factored it to make more sense and be a lot simpler. I didn't really use any methodology that wasn't in your original code. I don't see why you should get any errors (bar spellings mistakes/typos).

 

Basic breakdown. The fetch_links function basically takes a URL you pass to it and uses the regex we came up with to fetch links from that page. That function is called with the first line of code. Then we simply loop through these links, passing them to fetch_email. The fetch_email function simply uses the other pattern we came up with to fetch the e-mail address from the page. It then returns it in a nicely formatted associative array in order to allow you to loop through it easier, associating the link with the email.

 

Edit: Just spotted a typo, in the following part I mistyped email. So it will throw a notice on each loop of the array (Undefined index).

 

echo "Link = " . $set['link'] ", E-mail = " . $set['emai'] . "<br/>";

 

Link to comment
Share on other sites

Yes in your original code you had one bit of curl, but you also used file_get_contents. As far as I could tell you weren't using a single feature that curl supported that makes it advantageous over the simpler file_get_contents. If needed the curl can be added back in, but it'll add like 10lines and not really improve anything.

 

I didn't really check the code very thoroughly I just re factored it to make more sense and be a lot simpler. I didn't really use any methodology that wasn't in your original code. I don't see why you should get any errors (bar spellings mistakes/typos).

 

Basic breakdown. The fetch_links function basically takes a URL you pass to it and uses the regex we came up with to fetch links from that page. That function is called with the first line of code. Then we simply loop through these links, passing them to fetch_email. The fetch_email function simply uses the other pattern we came up with to fetch the e-mail address from the page. It then returns it in a nicely formatted associative array in order to allow you to loop through it easier, associating the link with the email.

 

Edit: Just spotted a typo, in the following part I mistyped email. So it will throw a notice on each loop of the array (Undefined index).

 

echo "Link = " . $set['link'] ", E-mail = " . $set['emai'] . "<br/>";

 

Actually the error i am getting is Parse error: syntax error, unexpected T_STRING in C:\xampp\htdocs\ok.php on line 26  which is here

 echo "Link = " . $set['link'] ", E-mail = " . $set['email'] . "<br/>";

Link to comment
Share on other sites

LIike I said, probably just typos. That line is also missing a fullstop after $set['link'] to concatinate it. If you are getting the word Array in the database then you are attempting to insert an array to the database.

Link to comment
Share on other sites

LIike I said, probably just typos. That line is also missing a fullstop after $set['link'] to concatinate it. If you are getting the word Array in the database then you are attempting to insert an array to the database.

 

Yeah I added the dot in there, but what is happening now is the script is loading.....loading...and loading some more. It never stops loading? 

Is it something to do with the DB statements I have added?

 

<?php

$links = fetch_links("http://southcoast.craigslist.org/sss/");
$sets = array();
foreach($links as $link) {
   $sets[] = fetch_email($link);
}

foreach($sets as $set) {
   $dbx= mysql_connect("localhost", "root", "");   //include before any database implematation


if (!$dbx)
{
die('Could not connect: ' . mysql_error());
}

mysql_SELECT_db("craigslist", $dbx);
mysql_Query("INSERT INTO addresses (sale_items)
VALUES ('$set')");

mysql_close($dbx);


echo "Link = " . $set['link'] . ", E-mail = " . $set['email'] . "<br/>";

}

function fetch_links($page_url) {
   $pattern = '#<a href="(/[a-z]{3}/\d{10}\.html)">#';
   $page = file_get_contents($page_url);
   preg_match_all($pattern, $page, $matches);
   return $matches[1];
}

function fetch_email($page_link) {
   $pattern = '#(sale-[a-z0-9]+-\d+@craigslist\.org)#';
   $page = file_get_contents("http://southcoast.craigslist.org" . $page_link);
   preg_match($pattern, $page, $out);
   return array('link'=>$page_link, 'email'=>$out[1]);
}

?>

Link to comment
Share on other sites

a.) It's likely to take a while to run, it needs to download a website for every link on the page and there are 100 of them. That means it has to download 100 websites and process the information.

 

b.) There is no point putting the mysql_connect statement inside the loop. This will slow down the whole process and you only need to do it once, so move it outside of the loop (same goes for the if statement checking for connection and also selecting the db).

 

c.) You are trying to insert $set into the database, $set is an array. As the example output showed, if you wish to insert just the address in the database you would use $set['link'].

 

d.) mysql_close is fairly redundant, since the resources will be cleared up by PHP automatically. It certainly doesn't want to be done for every link.

 

$dbx = mysql_connect("localhost", "root", "") or trigger_error("ERROR: " . mysql_error(), E_USER_ERROR);
mysql_select_db("craigslist", $dbx);

foreach($sets as $set) {
   $sql = "INSERT INTO addresses (sale_items) VALUES ('".$set['link']."')";
   mysql_query($sql) or trigger_error("SQL: $sql, ERROR: " . mysql_error(), E_USER_ERROR);
}

Link to comment
Share on other sites

a.) It's likely to take a while to run, it needs to download a website for every link on the page and there are 100 of them. That means it has to download 100 websites and process the information.

 

b.) There is no point putting the mysql_connect statement inside the loop. This will slow down the whole process and you only need to do it once, so move it outside of the loop (same goes for the if statement checking for connection and also selecting the db).

 

c.) You are trying to insert $set into the database, $set is an array. As the example output showed, if you wish to insert just the address in the database you would use $set['link'].

 

d.) mysql_close is fairly redundant, since the resources will be cleared up by PHP automatically. It certainly doesn't want to be done for every link.

 

$dbx = mysql_connect("localhost", "root", "") or trigger_error("ERROR: " . mysql_error(), E_USER_ERROR);
mysql_select_db("craigslist", $dbx);

foreach($sets as $set) {
   $sql = "INSERT INTO addresses (sale_items) VALUES ('".$set['link']."')";
   mysql_query($sql) or trigger_error("SQL: $sql, ERROR: " . mysql_error(), E_USER_ERROR);
}

 

That worked out well, but why is it showing links as well. I thought we are telling the script to extract just the email address?  On another note it's actually only copying the link to the DB instead of the email....Link = /fuo/1449448316.html

It's spitting out this:

Link = /fuo/1449448316.html, E-mail = sale-f9vrq-1449448316@craigslist.org

Link = /atq/1449441550.html, E-mail = sale-bmp4h-1449441550@craigslist.org

Link = /for/1449434550.html, E-mail = sale-yggcb-1449434550@craigslist.org

Link = /for/1449440349.html, E-mail = sale-9tpdj-1449440349@craigslist.org

Link = /fuo/1449439768.html, E-mail = sale-amqpb-1449439768@craigslist.org

Link = /ele/1449438357.html, E-mail = sale-syt7a-1449438357@craigslist.org

Link = /rvs/1449425290.html, E-mail = sale-fszmj-1449425290@craigslist.org

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.