Jump to content

[SOLVED] Small string..can't get the right combo to pull it in.


Recommended Posts

Admittedly, your initial post doesn't explain things fully. Are you specifically looking for ".html" at the end? Could it be any file extension? Is the amount of directories the same, or can they be different on a case by case basis? Will the file name always contain only numbers, or a mix of letters and/or numbers?

 

It's helpful to provide multiple samples of what you are sifting though, demonstrating a variety of strings, explaining what will be consistent, what might change, and what exactly you are trying to match / capture.

 

Going by the info provided so far, this is what I 'assume' is what you are looking for (using preg_match as an example):

 

$str = '/mcy/1435258204.html';
preg_match('#^/[a-z]+/[0-9]+\.html$#i', $str, $match);
echo $match[0]; // Output: /mcy/1435258204.html

 

But again, without much explanation, it's not clear on the conditions to be honest. You can read more about helpful suggestions here.

Admittedly, your initial post doesn't explain things fully. Are you specifically looking for ".html" at the end? Could it be any file extension? Is the amount of directories the same, or can they be different on a case by case basis? Will the file name always contain only numbers, or a mix of letters and/or numbers?

 

It's helpful to provide multiple samples of what you are sifting though, demonstrating a variety of strings, explaining what will be consistent, what might change, and what exactly you are trying to match / capture.

 

Going by the info provided so far, this is what I 'assume' is what you are looking for (using preg_match as an example):

 

$str = '/mcy/1435258204.html';
preg_match('#^/[a-z]+/[0-9]+\.html$#i', $str, $match);
echo $match[0]; // Output: /mcy/1435258204.html

 

But again, without much explanation, it's not clear on the conditions to be honest. You can read more about helpful suggestions here.

 

The / is always at the beginning and the html is always at the end. There is a vertical list with more which i will tackle after i can actually wrap my head around the regex. I just finished reading O'Reilly's mastering regular expressions vol. 2 , but the biggest problem i am having is knowing how to wrap the regex statement in general. It seems i am seeing folks use different characters and it is confusing the hell out me?

 

Is it (....) is it ~...~ is it "...."  is it '...'

 

There must be a standard character to close this function.

the if statement is enclosed in { ...}

 

You know what i'm saying?

In PCRE regular expressions the pattern must be enclosed between delimeters. These delimeters can be a large selection of characters, with alphanumeric characters being the biggest exception. Generally speaking people just choose a char that is unlikely to appear in their pattern as this reduces the amount of escaping required. In your original post you used the tilde character wheras nrg_alpha used the hash. It really makes no great difference. Whilst generally speaking the delimeters should be the same character, I started a recent thread that discussed the fact that you can also use a couple of 'sets' as the delimiters such as {} and <> etc. The characters that are included after the closing delimiter are whats called pattern modifiers.

In PCRE regular expressions the pattern must be enclosed between delimeters. These delimeters can be a large selection of characters, with alphanumeric characters being the biggest exception. Generally speaking people just choose a char that is unlikely to appear in their pattern as this reduces the amount of escaping required. In your original post you used the tilde character wheras nrg_alpha used the hash. It really makes no great difference. Whilst generally speaking the delimeters should be the same character, I started a recent thread that discussed the fact that you can also use a couple of 'sets' as the delimiters such as {} and <> etc. The characters that are included after the closing delimiter are whats called pattern modifiers.

 

love you Cags, but you just confused the shit out of me. I'm going to go read mastering regular expressions again :wtf:

With regards to delimiters, the only thing to remember is that they can be any non-white space, non alpha numeric ASCII characters other than a backslash (or null byte apparently). You can read up on delimiters here.

 

With regards to (...), "...", '...' etc.. I'm not sure I follow.. perhaps posting a small portion of code you are trying to use with regards to regex will help out.

 

NOTE: cags basically cut and paste what I linked too in the php manual.. D'oh!

I'd like to add that i don't want to be limited to grabbing one link so is creating the $str variable necessary for this?

 

There are about 50 links on each page i just the first link as an example.

Cags helped me out with a similar preg to match email addresses. This time i'm attempting to grab some thinks that i can open and grab that email address (the one cags assisted me with :)

No, $str is only an example I used.. if you want multiple links, you could use preg_match_all.. granted, typically, when dealing with parsing html, it's wiser to use dom for this kind of thing (but that's an entirely different ball of wax).

There must be a standard character to close this function.

Generally forward slashes (/), though not if they occur within the pattern (common with parsing URIs or HTML). In the latter case, common alternatives are tilde (~) or hash/pound (#).

 

E.g.

/foobar\.html/i
/\/foo\/bar\.html/  <-- ugly
~/foo/bar\.html~

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).

 

$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}

 

Will output something like (shortened to save scrolling):

 

Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).

 

$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}

 

Will output something like (shortened to save scrolling):

 

Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html

 

Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"? ;D 

 

Oh yeah, the mcy is not included in every string i want so i need to 86 that part.

Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"? ;D

 

I'm not too sure I'd be up for a brain swap (though I'm sure yours is a lovely brain) but keep on posting questions and I'll keep posting replies (and maybe some answers).  ;D

Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"? ;D

 

I'm not too sure I'd be up for a brain swap (though I'm sure yours is a lovely brain) but keep on posting questions and I'll keep posting replies (and maybe some answers).  ;D

[/quote

You sure? I hear brain swapping is in! 

 

 

$html    = file_get_contents($url); 

 

Out put = the text "links" ONLY without the actual links.

 

 <?php  
    function curlURL($url) {  
        $curl = curl_init();  
        curl_setopt($curl, CURLOPT_URL, $url);  
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);  
        curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');  
        $output = curl_exec($curl);  
            return $output;  
    }  
$url = "http://southcoast.craigslist.org/tls/1432616932.html";
$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
           echo $link . "\n";
}

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).

 

$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}

 

Will output something like (shortened to save scrolling):

 

Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html

 

 

This not working for me.

Show us the string you are trying to match (including and surrounding text).

 

Here's the (links) only, that i am trying to grab:

 

<p><a href="/pts/1436241251.html">1989 Jeep Wrangler parts - $600 -</a><font size="-1"> (bergen county)</font> <span class="p"> pic</span> <<<i><a href="/pts/">auto parts</a></i></p>

 

<p><a href="/vgm/1436241144.html">Xbox 360 / Wii / PSP/iPhone Flashing - $30 -</a><font size="-1"> (Roselle)</font> <<<i><a href="/vgm/">video gaming</a></i></p>

<p><a href="/emd/1436239956.html">OVER 200 CASSETTE'S ROCK COUNTRY - $50 -</a><font size="-1"> (RANDOLPH)</font> <<<i><a href="/emd/">cds / dvds / vhs</a></i></p>

<p><a href="/bfs/1436240970.html">Business for Sale  - $195000 -</a><font size="-1"> (Newark)</font> <<<i><a href="/bfs/">business/commercial</a></i></p>

 

<p><a href="/ele/1436240954.html">SONY 27-inch TRINITRON Flat Screen (Great TV!) - $150 -</a><font size="-1"> (Belleville)</font> <span class="p"> pic</span> <<<i><a href="/ele/">electronics</a></i></p>

<p><a href="/hsh/1436240849.html">Wooden Doorway Gate - $9 -</a><font size="-1"> (Denville, NJ)</font> <span class="p"> pic</span> <<<i><a href="/hsh/">household items</a></i></p>

 

<p><a href="/pts/1436240687.html">Oldsmobile 1965-1966 gasket set -</a><font size="-1"> (Pequannock)</font> <span class="p"> pic</span> <<<i><a href="/pts/">auto parts</a></i></p>

<p><a href="/cto/1436240412.html">1994 Volvo 940 Sedan - $1200 -</a><font size="-1"> (Springfield, NJ)</font> <span class="p"> pic</span> <<<i><a href="/cto/">cars & trucks - by owner</a></i></p>

 

<p><a href="/hsh/1436239243.html">Pet Travel Kennel - $18 -</a><font size="-1"> (Parsippany, NJ)</font> <span class="p"> pic</span> <<<i><a href="/hsh/">household items</a></i></p>

 

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).

 

$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}

 

Will output something like (shortened to save scrolling):

 

Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html

 

Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"? ;D 

 

Oh yeah, the mcy is not included in every string i want so i need to 86 that part.

 

 

foreach ($matches[1] as $link) {  //This part of the code is confusing me, why the [1] in there?

 

 

 

 

preg_match_all returns a multi dimensional array. $matches[0] will contain all strings that match the entire pattern. So for example in the string you just provided $matches[0][0] will contain...

 

<a href="/pts/1436241251.html">

 

$matches[1] contains an array of all patterns matched by the first capture group (content inside the first set of parentheses/brackets), so using your example again... $matches[1][0] contains...

 

/pts/1436241251.html

The solution provided by salathe has a literal mcy in the string, which those links don't. You'd need to use something more like...

 

#<a href="(/[a-z]{3}/\d{10}\.html)">#'

 

Ok so / for the beginning (why you not use ^)

Then [a-z] is is pretty self explanatory

Then you have {3} (Not sure i understand this one?)

Then you have /\ I imagine \d {10} represents the 10 digits?

Finally you have \ but why?

The solution provided by salathe has a literal mcy in the string, which those links don't. You'd need to use something more like...

 

#<a href="(/[a-z]{3}/\d{10}\.html)">#'

 

Ok so / for the beginning (why you not use ^)

Then [a-z] is is pretty self explanatory

Then you have {3} (Not sure i understand this one?)

Then you have /\ I imagine \d {10} represents the 10 digits?

Finally you have \ but why?

 

Ok i get the {3} which is for 3 characters.

#<a href="(/[a-z]{3}/\d{10}\.html)">#

 

# - opening delimiter

<a href=" - literal string ie find this exact pattern

( - start a new capture group

/ - literal forward slash (as all the links start with a forward slash)

[a-z]{3} - 3 letters

/ - literal forward slash

\d{10} - a 10 digit number

\. - a full stop character (the backslash escapes it as it is a special character)

html - another literal string

) - close capture group

"> - yet more literal characters

# - ending delimiter

 

file_get_contents($url) Not sure i like this method. Seems a bit much, considering it shows error stating it can't be empty?

 

Isn't it easier just to say $curlResults = curlURL("http://newjersey.craigslist.org/sss/"); 

  preg_match_all (#<a href="(/[a-z]{3}/\d{10}\.html)">#', $curlResults, $out);

  echo $out[1][0];

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.