how to do this regex ?

jjk2 · March 30, 2009

i am kinda lost how i can appraoch this.

basically i have many file paths like this

crazy.com/main/videos/something/popular/index.html

crazy.com/latest/news/odds/home.jpg

crazy2.com/funny/world/politics/welcome.html

another.com/news/business/index.html

how can i get only the things in bold ?

also, the filenames differs dynamically.

sasa · March 30, 2009

preg_match('#/.*/#', $input, $output);
print_r($output);

Salkcin · March 30, 2009

even the above regex does what you ask, here's another one

preg_match('#(?<=crazy\.com).*?(?=(?:index|welcome\.html)|home\.jpg)#', $string, $match);
print_r($match);

nrg_alpha · March 30, 2009

even the above regex does what you ask, here's another one
preg_match('#(?<=crazy\.com).*?(?=(?:index|welcome\.html)|home\.jpg)#', $string, $match);
print_r($match);

There are a few issues with your suggestion however...

a) That's 'probably' more work than sasa's method (while I don't advocate .* too often, it does have its uses, and depending on whether the url entries are by themselves to be checked (not nested within some large block of text), that method is more likely to be faster (granted, I haven't tested the speed difference between yours and sasa's... I'm going on the assumption of positive look behind and ahead assertions vs some minor .* backtracking [although, admittedly I could be wrong on this]).

b) Your pattern requires specific domains - (?<=crazy\.com) [so what happens with crazy2.com or another.com?] with specific ending file names (such as index or welcome.html by example) The following code illustrates this these issues:

$arr = array('crazy.com/main/videos/something/popular/index.html','crazy.com/latest/news/odds/home.jpg','crazy2.com/funny/world/politics/welcome.html','another.com/news/business/index.html');
foreach ($arr as $val) {
echo (preg_match('#(?<=crazy\.com).*?(?=(?:index|welcome\.html)|home\.jpg)#', $val))? $val . "<br />\n" :  'Url format not found using regex pattern...' .  "<br />\n";
}

output:

crazy.com/main/videos/something/popular/index.html
crazy.com/latest/news/odds/home.jpg
Url format not found using regex pattern...
Url format not found using regex pattern...

Point being, I think the idea is to be able to match directories of any url (thus, regex patterns being flexable), which sasa's is.

laffin · March 30, 2009

Yes, but it fails the condition of capturing the path only.

so even tho salkin's method is a bit longer it does wut was requested.

but ya may want to look at

parse_url function instead

and play with that instead of using regex

nrg_alpha · March 30, 2009

I wasn't illustrating the path capturing so much as the restriction on domain names and file names that need to be found within the pattern in the first place. sasa's is more flexible. And yes, parse_url would be even better (again, assuming that the url in question is by itself and not embedded within a string).

Sign In

how to do this regex ?

Recommended Posts

jjk2

Link to comment

Share on other sites

sasa

Link to comment

Share on other sites

Salkcin

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

laffin

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information