Howto to spam-protect your python-based blog with bayesian filter.
As severall people, I run into issue with some spammer using my comment system to spam, and post backlinks. (Even using some funny stuffs)
I ‘m already using a good email spam filter: SpamBayes, so I decided to test bayesian filtering for the spam on this blog too.
I decided to give Reverend a try:
from reverend.thomas import Bayes
SPAM_DB='spam.bayes'
guesser = Bayes()
# load the spam DB
try:
guesser.load(SPAM_DB)
except IOError:
print "Creating a new spam filter database"
guesser.save(SPAM_DB)
def train_spam(text):
guesser.train('spam',text)
guesser.save(SPAM_DB)
def train_ham(text):
guesser.train('ham',text)
guesser.save(SPAM_DB)
# try to guess the spam / ham ratio of a text
def guess(text):
spam = 0
ham = 0
value = guesser.guess(text)
for o in value:
if o[0] == 'ham': ham = o[1]
if o[0] == 'spam': spam = o[1]
return (ham,spam)
Small, and really simple module no ? The next step, simply add a ’spam’ and ‘ham’ attributes on your comment post. And add two methods to train the comment as a spam or a ham.. And of course, only display comments which have a good ratio ( >1) ham/spam. This took me about 1 hour to implement…
After a week, of train, this is working very fine, not a single false positive, and it filter every spam since the first trains. As I get around 20 spams post per day, this is quite a good news ;)
Enjoy Bayesian ?
- SpamBayes server compliant w/ spamassassin
- Full featured SMTP in Python ?
- A new blog in Webware ?
- How to spam a blog with Plone for Free :(
- Test Post from PyMT
admin November 17th, 2006
- Misc
- Comments(7)
Thanks, Reverend Jkx :)
So, now your comments RSS feed is usable, right? Because it still seem to contain some spam in there (article 239)..
Yes, my comment RSS is still full of SPAM. I need to apply the filter here to. Right now, I’m using this to check everything is Ok. I will switch soon.
Bye ..
What happens to the spam database file when two people submit a comment at the same time? Is there a way to prevent it from getting corrupted?
This depends on the way you plug this in your webapp, but you can easily protect the write with a lock.
Spam Filtering may reduce the number of spam for a short while but you cant say that it is an ultimate solution to Spamming. The reason is that the Spammers are aware of these filtering techniques whether it is Filtering with BogoFire or some other. There are many websites available that are providing the information on Anti-Spamming Solutions but most of this information is either irrelevant or not useful. I have recently visited a website that I would like to suggest
Anti-Spam Solutions Website
How is the filter working out these days? Is this approach still worth implementing? Thanks, Mike