Jump to content

Strip down ANY url to just the domain?


JohnnyDoomo

Recommended Posts

I've searched long and hard and I can't find anything that works.

 

I'm new to php and can't program much, but every tutorial I follow on this doesn't do it how I need.

 

I need something that can take ANY type of looking url and simply return domain.com

 

Every tutorial I look at seems to have problems if a domain is submitted that looks like any of the following:

 

google.com

www.google.com

https://google.com

http://www.google.com

http://www.google.co.uk

http://www.google.co.uk/blah/blah/blah

http://subdomain.google.co.uk/blah/blah/blah

http://www.google.com/blah/blah/blah.php?arg=value#anchor

 

 

Any piece of code I find and test out, it screws up on one or the other. Can anybody please help me with something that is actually somewhat intelligent? I feel like I've only seen tutorials written by programmers that have no idea what I'm trying to get.

 

I'm looking for something that can take a url, no matter how it is written, to process it and take it down to it's most simplest form and make it look like domain.com.

 

I don't know much about php, but I've learned that this parse_url command IMO is shit for what I'm trying to do. Every tutorial that tries to help me with a few lines gets it wrong on one of the above domains.

 

I don't know much about if statements, but I'm at the point I feel like I have to learn that just to write out dozens of statements to remove everything.

 

Please help!

Link to comment
Share on other sites

So check if the first part is http:// and if not add it...

 

Yeah, I was just writing some code for that when I realized there is a bigger problem. The 'host' index for parse_url() returns the entire host name. A host name can have multiple subdomains and at least in some instances a host name can have multiple TLDs such as .co.uk. So a URL of 'http://sub1.sub2.google.co.uk' would return 'sub1.sub2.google.co.uk'. How would you programatically know which of those are subdomains? I don'k know if the .uk is the only one that allows for a "sub" TLD, but if so you could code a special case for that and have logic such as:

 

If does not end in UK:

 - Return everything after second to last dot (if there are at least 2), else return entire string

If does end in UK:

 - Return everything after third to last dot (if there are at least 3), else return entire string

Link to comment
Share on other sites

Hmm . . . I went ahead and coded around the missing 'http' problem (guess I need to update my PHP install) and I wrote a short script that seems to work the same as that linked script with far fewer lines of code. Not guranteeing it 100% but it worked for all the sample values of the OP and additional testng I did:

 

 

<?php
 
function returnDomainName($url)
{
    //If does not begin with http, add it
    if(strtolower(substr($url, 0, 4)) != 'http')
    {
        $url = 'http://' . $url;
    }
    //Attempt to get components
    $components = parse_url($url);
    //If failed, return false
    if(!$components) { return false; }
    //Detemine how many parts are needed based on .uk at the end
    $partCount = (strtolower(strrchr($components['host'], '.')) != '.uk') ? 2 : 3;
    //Explode based on dots
    $partsAry = explode('.', $components['host']);
    //Implode the last $partCount parts back with a dot
    $domain = implode('.', array_slice($partsAry, -1*$partCount));
    return $domain;
}

//Array of test values
$urlList = array(
    'google.com',
    'www.google.com',
    'https://google.com',
    'http://www.google.cds',
    'http://www.google.co.uk',
    'http://www.google.co.uk/blah/blah/blah',
    'http://sub1.sub2.google.co.uk:443',
    'http://subdomain.google.com/blah/blah/blah',
    'http://www.google.com?rg=value#anchor'
    );

//Test loop
foreach($urlList as $url)
{
    echo "URL: $url<br>";
    echo "Domain: " . returnDomainName($url);
    echo "<br><br>";
}
 
?>

 

Output

 

URL: google.com
Domain: google.com

URL: www.google.com
Domain: google.com

URL: https://google.com
Domain: google.com

URL: http://www.google.cds
Domain: google.cds

URL: http://www.google.co.uk
Domain: google.co.uk

URL: http://www.google.co.uk/blah/blah/blah
Domain: google.co.uk

URL: http://sub1.sub2.google.co.uk:443
Domain: google.co.uk

URL: http://subdomain.google.com/blah/blah/blah
Domain: google.com

URL: http://www.google.com?rg=value#anchor
Domain: google.com
Edited by Psycho
Link to comment
Share on other sites

I don'k know if the .uk is the only one that allows for a "sub" TLD

[/code]

 

It's not, there are a bunch. What it kind of boils down to is how accurate one wants to be with respect to that. There are only a few that are common (in my experience) which you could easily code a few special cases for. A little digging around on wikipedia lead me to a list of possible multi-level domains if you wanted to use it to be more accurate.

Link to comment
Share on other sites

 

Hmm . . . I went ahead and coded around the missing 'http' problem (guess I need to update my PHP install) and I wrote a short script that seems to work the same as that linked script with far fewer lines of code. Not guranteeing it 100% but it worked for all the sample values of the OP and additional testng I did:

 

 

<?php
 
function returnDomainName($url)
{
    //If does not begin with http, add it
    if(strtolower(substr($url, 0, 4)) != 'http')
    {
        $url = 'http://' . $url;
    }
    //Attempt to get components
    $components = parse_url($url);
    //If failed, return false
    if(!$components) { return false; }
    //Detemine how many parts are needed based on .uk at the end
    $partCount = (strtolower(strrchr($components['host'], '.')) != '.uk') ? 2 : 3;
    //Explode based on dots
    $partsAry = explode('.', $components['host']);
    //Implode the last $partCount parts back with a dot
    $domain = implode('.', array_slice($partsAry, -1*$partCount));
    return $domain;
}

//Array of test values
$urlList = array(
    'google.com',
    'www.google.com',
    'https://google.com',
    'http://www.google.cds',
    'http://www.google.co.uk',
    'http://www.google.co.uk/blah/blah/blah',
    'http://sub1.sub2.google.co.uk:443',
    'http://subdomain.google.com/blah/blah/blah',
    'http://www.google.com?rg=value#anchor'
    );

//Test loop
foreach($urlList as $url)
{
    echo "URL: $url<br>";
    echo "Domain: " . returnDomainName($url);
    echo "<br><br>";
}
 
?>

 

Output

 

URL: google.com
Domain: google.com

URL: www.google.com
Domain: google.com

URL: https://google.com
Domain: google.com

URL: http://www.google.cds
Domain: google.cds

URL: http://www.google.co.uk
Domain: google.co.uk

URL: http://www.google.co.uk/blah/blah/blah
Domain: google.co.uk

URL: http://sub1.sub2.google.co.uk:443
Domain: google.co.uk

URL: http://subdomain.google.com/blah/blah/blah
Domain: google.com

URL: http://www.google.com?rg=value#anchor
Domain: google.com

 

Thanks for your help Pyscho! This is working!

 

Can you tell me what code to add to make it handle both co.uk and com.au domains? (These seem the most popular of extensions, and probably the only two I've actually visited domains on.)

 

For those wondering about these type of domain extensions, I came across a large list of them: http://www.quackit.com/domain-names/country_domain_extensions.cfm

Link to comment
Share on other sites

Can you tell me what code to add to make it handle both co.uk and com.au domains? (These seem the most popular of extensions, and probably the only two I've actually visited domains on.)

 

I could, but I won't. I love helping people, but at some point you need to teach a man to fish rather than giving him a fish. You should be able to see where I implemented logic to handle co.uk. There's only one line to worry about which uses two functions and the ternary operator. You should be able to break down what that line is doing and figure out how to change it for multiple scenarios. Of course, you will need to break it out to multiple lines in a normal if/else condition rather than using the ternary operator.

 

So, give it a try and post back if you run into problems showing the code you have.

Edited by Psycho
Link to comment
Share on other sites

@psycho

 

While that covers all of the examples, I would tend to code the scheme check a little more generically. As it is, it will not cover mailto:, ftp: or others.

    //If does not begin with a scheme, add it
    if(strpos($url, '://') === false)
    {
        $url = 'http://' . $url;
    }
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.