Jump to content

feedback on a function


jonsjava

Recommended Posts

What could I add to this to make it more complete?

<?php

/**
* Anti-Spam function. It takes the 344 most common spam words/phrases, 
* and compares input to the list.  Returns array containing
* number found, the actual text found, and a score for the input
*
* @param unknown_type $input : the data you want to check
* @return array
*/
function stopSpam($input){
$count = 0;
$data = "'hidden' assets
-online
3.28
4u
accept credit cards
act now! don't hesitate!
additional income
addresses on cd
adipex
advicer
all natural
amazing stuff
apply online
as seen on
auto email removal
avoid bankruptcy
baccarrat
be amazed
be your own boss
being a member
big bucks
bill 1618
billing address
blackjack
bllogspot
booker
brand new pager
bulk email
buy direct
buying judgments
byob
cable converter
call free
call now
calling creditors
can't live without
cancel at any time
cannot be combined with any other offer
car-rental-e-site
car-rentals-e-site
carisoprodol
cash bonus
cashcashcash
casino
casinos
cell phone cancer scam
cents on the dollar
chatroom
check or money order
cialis
click below
click here link
click to remove
click to remove mailto
collect child support
compare rates
compete for your business
confidentially on all orders
congratulations
consolidate debt and credit
coolcoolhu
coolhu
copy accurately
copy dvds
credit bureaus
credit card offers
credit-card-debt
credit-report-4u
cures baldness
cwas
cyclen
cyclobenzaprine
dating-e-site
day-trading
dear email
dear friend
dear somebody
debt-consolidation
debt-consolidation-consultant
dig up dirt on friends
direct email
direct marketing
discreetordering
discusses search engine listings
do it today
don't delete
drastically reduced
duty-free
dutyfree
earn per week
easy terms
eliminate bad credit
email harvest
email marketing
equityloans
expect to earn
fantastic deal
fast viagra delivery
financial freedom
find out anything
fioricet
flowers-leading-site
for free
for instant access
for just $
free access
free cell phone
free consultation
free dvd
free grant money 	
free hosting
free installation
free investment
free leads
free membership
free money
free offer
free preview
free priority mail
free quote
free sample
free trial
free website
freenet
freenet-shopping
full refund
gambling-
get it now
get paid
get started now
gift certificate
great offer
guarantee
hair-loss
have you been turned down?
health-insurancedeals-4u
hidden assets
holdem
holdempoker
holdemsoftware
holdemtexasturbowilson
home employment
homeequityloans
homefinance
hotel-dealse-site
hotele-site
hotelse-site
human growth hormone
if only it were that easy
in accordance with laws
incest
increase sales
increase traffic
insurance
insurance-quotesdeals-4u
insurancedeals-4u
investment decision
it's effective
join millions of americans
jrcreations
laser printer
levitra
limited time only
long distance phone offer
lose weight spam
lower interest rates
lower monthly payment
lowest price
luxury car
macinstruct
mail in order form
marketing solutions
mass email
meet singles
member stuff
message contains disclaimer
mlm
money back
money making
month trial offer
more internet traffic
mortgage rates
mortgage-4-u
mortgagequotes
multi level marketing
name brand
new customers only
new domain extensions
nigerian
no age restrictions
no catch
no claim forms
no cost
no credit check
no disappointment
no experience
no fees
no gimmick
no inventory
no investment
no medical exams
no middleman
no obligation
no purchase necessary
no questions asked
no selling
no strings attached
not intended
off shore
offer expires
offers coupon
offers extra cash
offers free (often stolen) passwords
once in lifetime
one hundred percent free
one hundred percent guaranteed
one time mailing
online biz opportunity
online biz opportunity 	
online pharmacy
online-gambling
onlinegambling-4u
only $
opportunity
opt in
order now
order status
orders shipped by priority mail
ottawavalleyag
outstanding values
ownsthis
palm-texas-holdem-game
paxil
penis
pennies a day
people just leave money laying around
pharmacy
phentermine
please read
poker-chip
potential earnings
poze
print form signature
print out and fax
produced and sent out
profits
promise you ...!
pure profit
pussy
real thing
refinance home
removal instructions
remove in quotes
remove subject
removes wrinkles
rental-car-e-site
reply remove subject
requires initial investment
reserves the right
reverses aging
ringtones
risk free
roulette 
round the world
s 1618
safeguard notice
satisfaction guaranteed
save $
save big money
save up to
score with babes
search engine listings
section 301
see for yourself
sent in compliance
serious cash
serious only
shemale
shoes
shopping spree
sign up free today
slot-machine
social security number
special promotion
stainless steel
stock alert
stock pick
stop snoring
strong buy
stuff on sale
subject to credit
supplies are limited
take action now
terms and conditions
texas-holdem
the best rates
the following form
they keep your money -- no refund!
they're just giving it away
this isn't junk
this isn't spam
thorcarlson
top-e-site
top-site
tramadol
trim-spa
ultram
university diplomas
unlimited
unsecured credit/debt
urgent
us dollars
vacation offers
valeofglamorganconservatives
viagra
viagra and other drugs
vioxx
wants credit card
we hate spam
we honor all
weekend getaway
what are you waiting for?
while supplies last
while you sleep
who really wins?
why pay more?
will not believe your eyes
winner
winning
work at home
xanax
you are a winner
you have been selected
your income
zolus";
$data = strtolower($data);
$data_array = explode("\n", $data);
foreach ($data_array as $value){
	if (stristr($input, $value)){
		$array['item'][] = $value;
		$count++;
	}
}
$total_spam_vars = count($data_array);
$score = ($count / $total_spam_vars) * 1000;
$array['total_found'] = $count;
$array['score'] = $score;
return $array;
}

$spam_score = stopSpam($input);
print_r($spam_score);

Link to comment
Share on other sites

Wouldn't it make more sense to count the number of occurrences of each of the phrases? For example, data containing 10 occurrences of one of the phrases must surely be more likely to be spam than data containing just 1 occurrence?

so, what you are saying is if I had this input

spam spam spam spam spam spam spam spam

I should give it more weight than

this is not spam

assuming that the measured word is "spam".  Well, that would work, but the phrases I have outlined are pretty much guaranteed to come from spammers, so it showing up once is a good indicator. The weight system is to see if the input has more than one "obvious spam" phrase in it.  If it does, it weighs it as more "loaded with spam" than just one phrase. This is to keep people from being blocked form saying stuff like "I recieved an e-mail from your site that said 'act now! don't hesitate!', and I just want you to know that I don't appreciate receiving spam".  I want people to fill out the contact form, and even quote possible spam, with being weighed so heavily that my spam filter blocks them from contacting me.

Link to comment
Share on other sites

but the phrases I have outlined are pretty much guaranteed to come from spammers
Well I've certainly both received and sent emails in the last week that contain more than one of these phrases, and that aren't spam.

 

"terms and conditions", negotiating a software license for use in a product

"billing address", e-mailed receipt for a purchase by credit card

"insurance", e-mailed confirmation that a renewal for my car insurance online had been received

 

Your weighting system would flag them as probable, but not definite spam, while the dozen or so e-mails offering me cheap cia1is would probably be accepted purely because of a single character change.

Link to comment
Share on other sites

but the phrases I have outlined are pretty much guaranteed to come from spammers
Well I've certainly both received and sent emails in the last week that contain more than one of these phrases, and that aren't spam.

 

"terms and conditions", negotiating a software license for use in a product

"billing address", e-mailed receipt for a purchase by credit card

"insurance", e-mailed confirmation that a renewal for my car insurance online had been received

 

Your weighting system would flag them as probable, but not definite spam, while the dozen or so e-mails offering me cheap cia1is would probably be accepted purely because of a single character change.

...and that's a question I should have posed before.  Any ideas to convert the numbers between letters to associated letters that.....well, you get the big idea. change cia1is to cialis?

Link to comment
Share on other sites

Well, you could decide what numbers are likely to represent which characters and perform a conversion. But there's so many different ways in which someone could modify the text slightly: miss-spellings, underscores, dashes etc. A potential solution would be to use the levenshtein distance - though it might be too slow.

 

I'd probably just use someone else's spam filter and let them do the hard work :P

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.