[SOLVED] Small string..can't get the right combo to pull it in.

Modernvox · October 24, 2009

/mcy/1435258204.html

Tried ('~^/[a-z0-9][^/]($html)~'

Can you get it?

Daniel0 · October 24, 2009

What are you trying to do? Extract the number?

Modernvox · October 24, 2009

What are you trying to do? Extract the number?

No it's a link. I want the link.

nrg_alpha · October 24, 2009

Admittedly, your initial post doesn't explain things fully. Are you specifically looking for ".html" at the end? Could it be any file extension? Is the amount of directories the same, or can they be different on a case by case basis? Will the file name always contain only numbers, or a mix of letters and/or numbers?

It's helpful to provide multiple samples of what you are sifting though, demonstrating a variety of strings, explaining what will be consistent, what might change, and what exactly you are trying to match / capture.

Going by the info provided so far, this is what I 'assume' is what you are looking for (using preg_match as an example):

$str = '/mcy/1435258204.html';
preg_match('#^/[a-z]+/[0-9]+\.html$#i', $str, $match);
echo $match[0]; // Output: /mcy/1435258204.html

But again, without much explanation, it's not clear on the conditions to be honest. You can read more about helpful suggestions here.

Modernvox · October 24, 2009

Admittedly, your initial post doesn't explain things fully. Are you specifically looking for ".html" at the end? Could it be any file extension? Is the amount of directories the same, or can they be different on a case by case basis? Will the file name always contain only numbers, or a mix of letters and/or numbers?

It's helpful to provide multiple samples of what you are sifting though, demonstrating a variety of strings, explaining what will be consistent, what might change, and what exactly you are trying to match / capture.

Going by the info provided so far, this is what I 'assume' is what you are looking for (using preg_match as an example):
$str = '/mcy/1435258204.html';
preg_match('#^/[a-z]+/[0-9]+\.html$#i', $str, $match);
echo $match[0]; // Output: /mcy/1435258204.html
But again, without much explanation, it's not clear on the conditions to be honest. You can read more about helpful suggestions here.

The / is always at the beginning and the html is always at the end. There is a vertical list with more which i will tackle after i can actually wrap my head around the regex. I just finished reading O'Reilly's mastering regular expressions vol. 2 , but the biggest problem i am having is knowing how to wrap the regex statement in general. It seems i am seeing folks use different characters and it is confusing the hell out me?

Is it (....) is it ~...~ is it "...." is it '...'

There must be a standard character to close this function.

the if statement is enclosed in { ...}

You know what i'm saying?

cags · October 24, 2009

In PCRE regular expressions the pattern must be enclosed between delimeters. These delimeters can be a large selection of characters, with alphanumeric characters being the biggest exception. Generally speaking people just choose a char that is unlikely to appear in their pattern as this reduces the amount of escaping required. In your original post you used the tilde character wheras nrg_alpha used the hash. It really makes no great difference. Whilst generally speaking the delimeters should be the same character, I started a recent thread that discussed the fact that you can also use a couple of 'sets' as the delimiters such as {} and <> etc. The characters that are included after the closing delimiter are whats called pattern modifiers.

Modernvox · October 24, 2009

In PCRE regular expressions the pattern must be enclosed between delimeters. These delimeters can be a large selection of characters, with alphanumeric characters being the biggest exception. Generally speaking people just choose a char that is unlikely to appear in their pattern as this reduces the amount of escaping required. In your original post you used the tilde character wheras nrg_alpha used the hash. It really makes no great difference. Whilst generally speaking the delimeters should be the same character, I started a recent thread that discussed the fact that you can also use a couple of 'sets' as the delimiters such as {} and <> etc. The characters that are included after the closing delimiter are whats called pattern modifiers.

love you Cags, but you just confused the shit out of me. I'm going to go read mastering regular expressions again :wtf:

nrg_alpha · October 24, 2009

With regards to delimiters, the only thing to remember is that they can be any non-white space, non alpha numeric ASCII characters other than a backslash (or null byte apparently). You can read up on delimiters here.

With regards to (...), "...", '...' etc.. I'm not sure I follow.. perhaps posting a small portion of code you are trying to use with regards to regex will help out.

NOTE: cags basically cut and paste what I linked too in the php manual.. D'oh!

Modernvox · October 24, 2009

I'd like to add that i don't want to be limited to grabbing one link so is creating the $str variable necessary for this?

There are about 50 links on each page i just the first link as an example.

Cags helped me out with a similar preg to match email addresses. This time i'm attempting to grab some thinks that i can open and grab that email address (the one cags assisted me with

nrg_alpha · October 24, 2009

No, $str is only an example I used.. if you want multiple links, you could use preg_match_all.. granted, typically, when dealing with parsing html, it's wiser to use dom for this kind of thing (but that's an entirely different ball of wax).

salathe · October 24, 2009

There must be a standard character to close this function.

Generally forward slashes (/), though not if they occur within the pattern (common with parsing URIs or HTML). In the latter case, common alternatives are tilde (~) or hash/pound (#).

E.g.

/foobar\.html/i
/\/foo\/bar\.html/  <-- ugly
~/foo/bar\.html~

salathe · October 24, 2009

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).

$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}

Will output something like (shortened to save scrolling):

Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html

Modernvox · October 24, 2009

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).
$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}
Will output something like (shortened to save scrolling):
Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html

Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"?

Oh yeah, the mcy is not included in every string i want so i need to 86 that part.

salathe · October 24, 2009

Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"?

I'm not too sure I'd be up for a brain swap (though I'm sure yours is a lovely brain) but keep on posting questions and I'll keep posting replies (and maybe some answers).

Modernvox · October 24, 2009

Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"?

I'm not too sure I'd be up for a brain swap (though I'm sure yours is a lovely brain) but keep on posting questions and I'll keep posting replies (and maybe some answers).

[/quote

You sure? I hear brain swapping is in!

$html = file_get_contents($url);

Out put = the text "links" ONLY without the actual links.
 <?php  
    function curlURL($url) {  
        $curl = curl_init();  
        curl_setopt($curl, CURLOPT_URL, $url);  
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);  
        curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');  
        $output = curl_exec($curl);  
            return $output;  
    }  
$url = "http://southcoast.craigslist.org/tls/1432616932.html";
$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
          echo $link . "\n";
}

Modernvox · October 24, 2009

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).
$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}
Will output something like (shortened to save scrolling):
Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html

This not working for me.

cags · October 24, 2009

Show us the string you are trying to match (including and surrounding text).

Modernvox · October 24, 2009

Show us the string you are trying to match (including and surrounding text).

Here's the (links) only, that i am trying to grab:

<a href="/pts/1436241251.html">1989 Jeep Wrangler parts - $600 -</a> (bergen county) pic <<<a href="/pts/">auto parts</a>

<a href="/vgm/1436241144.html">Xbox 360 / Wii / PSP/iPhone Flashing - $30 -</a> (Roselle) <<<a href="/vgm/">video gaming</a>

<a href="/emd/1436239956.html">OVER 200 CASSETTE'S ROCK COUNTRY - $50 -</a> (RANDOLPH) <<<a href="/emd/">cds / dvds / vhs</a>

<a href="/bfs/1436240970.html">Business for Sale - $195000 -</a> (Newark) <<<a href="/bfs/">business/commercial</a>

<a href="/ele/1436240954.html">SONY 27-inch TRINITRON Flat Screen (Great TV!) - $150 -</a> (Belleville) pic <<<a href="/ele/">electronics</a>

<a href="/hsh/1436240849.html">Wooden Doorway Gate - $9 -</a> (Denville, NJ) pic <<<a href="/hsh/">household items</a>

<a href="/pts/1436240687.html">Oldsmobile 1965-1966 gasket set -</a> (Pequannock) pic <<<a href="/pts/">auto parts</a>

<a href="/cto/1436240412.html">1994 Volvo 940 Sedan - $1200 -</a> (Springfield, NJ) pic <<<a href="/cto/">cars & trucks - by owner</a>

<a href="/hsh/1436239243.html">Pet Travel Kennel - $18 -</a> (Parsippany, NJ) pic <<<a href="/hsh/">household items</a>

Modernvox · October 24, 2009

I can't edit my previous post to provide an example of grabbing the links that you want, so it'll have to be a double-post (sorry if you guys frown on that!).
$html    = file_get_contents($url);
$pattern = '#<a href="(/mcy/\d{10}\.html)">#';

preg_match_all($pattern, $html, $matches);

echo "Links:\n";
foreach ($matches[1] as $link) {
echo $link . "\n";
}
Will output something like (shortened to save scrolling):
Links:
/mcy/1435866184.html
/mcy/1435864882.html
/mcy/1435864500.html
...
/mcy/1435673391.html
/mcy/1435671439.html
Why thank you SIR Salathe' your time is and alwayd is much appreciated as well as your Wisdom. By the way, how a bout sending some of that wisdom this way in the form of let's say a "brain swap"?

Oh yeah, the mcy is not included in every string i want so i need to 86 that part.

foreach ($matches[1] as $link) { //This part of the code is confusing me, why the [1] in there?

cags · October 24, 2009

The solution provided by salathe has a literal mcy in the string, which those links don't. You'd need to use something more like...

#<a href="(/[a-z]{3}/\d{10}\.html)">#'

cags · October 24, 2009

preg_match_all returns a multi dimensional array. $matches[0] will contain all strings that match the entire pattern. So for example in the string you just provided $matches[0][0] will contain...

<a href="/pts/1436241251.html">

$matches[1] contains an array of all patterns matched by the first capture group (content inside the first set of parentheses/brackets), so using your example again... $matches[1][0] contains...

/pts/1436241251.html

Modernvox · October 24, 2009

The solution provided by salathe has a literal mcy in the string, which those links don't. You'd need to use something more like...
#<a href="(/[a-z]{3}/\d{10}\.html)">#'

Ok so / for the beginning (why you not use ^)

Then [a-z] is is pretty self explanatory

Then you have {3} (Not sure i understand this one?)

Then you have /\ I imagine \d {10} represents the 10 digits?

Finally you have \ but why?

Modernvox · October 24, 2009

The solution provided by salathe has a literal mcy in the string, which those links don't. You'd need to use something more like...
#<a href="(/[a-z]{3}/\d{10}\.html)">#'
Ok so / for the beginning (why you not use ^)

Then [a-z] is is pretty self explanatory

Then you have {3} (Not sure i understand this one?)

Then you have /\ I imagine \d {10} represents the 10 digits?

Finally you have \ but why?

Ok i get the {3} which is for 3 characters.

cags · October 24, 2009

#<a href="(/[a-z]{3}/\d{10}\.html)">#

# - opening delimiter

<a href=" - literal string ie find this exact pattern

( - start a new capture group

/ - literal forward slash (as all the links start with a forward slash)

[a-z]{3} - 3 letters

/ - literal forward slash

\d{10} - a 10 digit number

\. - a full stop character (the backslash escapes it as it is a special character)

html - another literal string

) - close capture group

"> - yet more literal characters

# - ending delimiter

Modernvox · October 24, 2009

file_get_contents($url) Not sure i like this method. Seems a bit much, considering it shows error stating it can't be empty?

Isn't it easier just to say $curlResults = curlURL("http://newjersey.craigslist.org/sss/");

preg_match_all (#<a href="(/[a-z]{3}/\d{10}\.html)">#', $curlResults, $out);

echo $out[1][0];

Sign In

[SOLVED] Small string..can't get the right combo to pull it in.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information