As severall people, I run into issue with some spammer using my comment system to spam, and post backlinks. (Even using some funny stuffs)
I ‘m already using a good email spam filter: SpamBayes, so I decided to test bayesian filtering for the spam on this blog too.
I decided to give Reverend a try:
from reverend.thomas import Bayes SPAM_DB='spam.bayes' guesser = Bayes() # load the spam DB try: guesser.load(SPAM_DB) except IOError: print "Creating a new spam filter database" guesser.save(SPAM_DB) def train_spam(text): guesser.train('spam',text) guesser.save(SPAM_DB) def train_ham(text): guesser.train('ham',text) guesser.save(SPAM_DB) # try to guess the spam / ham ratio of a text def guess(text): spam = 0 ham = 0 value = guesser.guess(text) for o in value: if o == 'ham': ham = o if o == 'spam': spam = o return (ham,spam)
Small, and really simple module no ? The next step, simply add a ‘spam’ and ‘ham’ attributes on your comment post. And add two methods to train the comment as a spam or a ham.. And of course, only display comments which have a good ratio ( >1) ham/spam. This took me about 1 hour to implement…
After a week, of train, this is working very fine, not a single false positive, and it filter every spam since the first trains. As I get around 20 spams post per day, this is quite a good news ;)
Enjoy Bayesian ?