Jump to content

How To Check For Duplicate URL Submission


vanleurth

Recommended Posts

Hola Everybody !!

 

I'm putting together a web app similar to Digg and was wondering if there is a function or code example I can use to avoid users submitt the same url.

 

For example:

Right now the user can submit;

1. http://www.example.com?post01

2. http://example.com?post01

3. www.example.com?post01

 

I want the web app to check if the link has been submitted by the user first and look for duplicate submission.

 

Any ideas ?

 

Thank you,

 

V.

Link to comment
Share on other sites

What I did was drop off any protocols like http://, https, www, and so on. Also remove the end slashes. In my case I use these as my titles, but you can use them just for checking purposes.

 

Is actually a lot to this, need to lowercase just the domain area in case they capitalize.

 

Checks for inserting in the form so they can type it in any way, like http://aol.com,http://www.aol.com,http://www.aol.com/,aol.com or anything similar can be inserted and be the same values. I then resolve them through curl.

 

Then you get url's such as http://mysite.com, which could also be the same exact url as http://mysite.com/index.html or http://mysite.com/index.php or http://mysite.com/index.asp and on and on. That's why I try to let curl resolve them. Javascript redirects aren't too pleasant, but you should be able to follow any normal redirects.

 

I been working on my login system so you can't browse my index right now, but the system I described works for me and took me a great deal of time to figure out. I did leave the non login areas live though like the add.

So try a url in any way and will see it will not do a duplicate.

http://dynaindex.com/add

Link to comment
Share on other sites

I had some time waiting for huge sized folders to transfer, I thought I'd be nice and write up a function to clean the url's and then another to check them if similar.

 

So the concept is to eliminate all the stuff that would make them different, but ultimately would go to the same or similar url.

 

That would include any protocols, the www , end slash , # at end , ? at end.

The www and end slash you will find out sometimes are or are not required because the website owners did not allow for that. Best to use curl to try and resolve the urls first. But then the url the user inserted would be different if was a normal redirect.

 

Lowercase anything from the domain name forward.

 

Here's the function file compareurl.php

<?php
function cleanUrl($input_url) {
if ($input_url == '') {
echo "EMPTY URL VALUE";
DIE;//redirect on empty value somewhere
} else {
$input_url = trim($input_url);
$input_url = rtrim($input_url,"/");
if ((substr($input_url, 0,  == "https://") OR (substr($input_url, 0, 7) == "http://") OR (substr($input_url, 0, 6) == "ftp://") OR (substr($input_url, 0, 7) == "feed://")) {
                $new_url = $input_url;
            } else {
                /*replace uppercase or unsupported to normal*/
                $url_input .= str_replace(array('feed://www.','feed://','HTTP://','HTTP://www.','HTTP://WWW.','http://WWW.','HTTPS://','HTTPS://www.','HTTPS://WWW.','https://WWW.'), '', $input_url);
                $new_url = "http://www.$url_input";
            } 
            $get_parse_url = parse_url($new_url, PHP_URL_HOST);//the parsed host
            $host_parse_url .= str_replace(array('Www.','WWW.'), '', $get_parse_url);//replace any uppers
            $host_parse_url = strtolower($host_parse_url);//lowercase host area
            $port_parse_url = parse_url($new_url, PHP_URL_PORT);//the port, omitted from clean_url
            $user_parse_url = parse_url($new_url, PHP_URL_USER);//users account
            $pass_parse_url = parse_url($new_url, PHP_URL_PASS);//users password
            $get_path_parse_url = parse_url($new_url, PHP_URL_PATH);//the file location or path
            $path_parse_url .= str_replace(array('Www.','WWW.'), '', $get_path_parse_url);//don't recall why I did this
            $query_add_parse_url = parse_url($new_url, PHP_URL_QUERY);//the query
            $query_add_parse_url = "?$query_add_parse_url";//add the ? back to front of query
            $query_add_parse_url = rtrim($query_add_parse_url, "#");//remove # from end
            $fragment_parse_url = parse_url($new_url, PHP_URL_FRAGMENT);//the end fragment
            $fragment_parse_url = "#$fragment_parse_url";//add # back to beginning fragment
            $fragment_parse_url = rtrim($fragment_parse_url,"#");//remove any # from end fragment
            $hostpath_url = "$host_parse_url$path_parse_url";//combine parsed url and path
            $hostpath_url = rtrim($hostpath_url, '?');//remove ? from parsed url and path
            $query_add_parse_url = rtrim($query_add_parse_url, '?');//remove ? from end of query
            $hostpathquery_url = "$host_parse_url$path_parse_url$query_add_parse_url";//host path and query combined
$complete_url = "$host_parse_url$user_parse_url$pass_parse_url$path_parse_url$query_add_parse_url$fragment_parse_url";//all combined minus port            
            $cleaned_url = "$host_parse_url$user_parse_url$pass_parse_url$path_parse_url$query_add_parse_url$fragment_parse_url";//all combined minus port, if want query or fragment gone remove it
$cleaned_url = trim($cleaned_url);//double check is no whitespace
$cleaned_url = rtrim($cleaned_url,"?");//remove ? from end of url
$cleaned_url = rtrim($cleaned_url,"#");//remove # from end of url
$cleaned_url = rtrim($cleaned_url,"/");//remove end slash
$cleaned_url = ltrim($cleaned_url, "www.");//remove www. if exists

RETURN $cleaned_url;
}
}

function compareUrl($url1,$url2) {
if (cleanUrl($url1) == cleanUrl($url2)) {
RETURN TRUE;
}
}       
?>

 

here's some example url's and usage:

<?php
//usage
//compareUrl() requires 2 variables to check against

include('compareurl.php');

//sample url's
$url1 = "http://www.site.com/mail/";
$url2 = "HTTP://SITE.com/mail?";
$url3 = "site.com/mail/";
$url4 = "https://site.com/mail?";
$url5 = "site.com/mail/";
$url6 = "mysite.com";
$url7 = "http://site.com?";
$url8 = "http://site.com/index.php?";
$url9 = "HTTP://SITE.COM/index.php/?";
$url10 = "http://site.com/index.php";

//check url 1 versus 2
if (compareUrl($url1,$url2) == TRUE) {
echo "$url1 and $url2 are the same <br />";//reject insert code
} else {
echo "$url1 and $url2 are different <br />";//accept insert code
}

//check url 3 versus 4
if (compareUrl($url3,$url4) == TRUE) {
echo "$url3 and $url4 are the same <br />";
} else {
echo "$url3 and $url4 are different <br />";
}

//check url 5 versus 6
if (compareUrl($url5,$url6) == TRUE) {
echo "$url5 and $url6 are the same <br />";
} else {
echo "$url5 and $url6 are different <br />";
}

//check url 6 versus 7
if (compareUrl($url6,$url7) == TRUE) {
echo "$url6 and $url7 are the same <br />";
} else {
echo "$url6 and $url7 are different <br />";
}

//check url 8 versus 9
if (compareUrl($url8,$url9) == TRUE) {
echo "$url8 and $url9 are the same <br />";
} else {
echo "$url8 and $url9 are different <br />";
}

//check url 9 versus 10
if (compareUrl($url9,$url10) == TRUE) {
echo "$url9 and $url10 are the same <br />";
} else {
echo "$url9 and $url10 are different <br />";
}
?>

Link to comment
Share on other sites

  • 5 weeks later...
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.