Howto to spam-protect your python-based blog with bayesian filter.

As severall people, I run into issue with some spammer using my comment system to spam, and post backlinks. (Even using some funny stuffs)

I ‘m already using a good email spam filter: SpamBayes, so I decided to test bayesian filtering for the spam on this blog too.

I decided to give Reverend a try:

from reverend.thomas import Bayes

SPAM_DB='spam.bayes'
guesser = Bayes()

# load the spam DB
try:
    guesser.load(SPAM_DB)
except IOError:
    print "Creating a new spam filter database"
    guesser.save(SPAM_DB)

def train_spam(text):
    guesser.train('spam',text)
    guesser.save(SPAM_DB)

def train_ham(text):
    guesser.train('ham',text)
    guesser.save(SPAM_DB)

# try to guess the spam / ham ratio of a text
def guess(text):
    spam = 0
    ham = 0
    value = guesser.guess(text)
    for o in value:
        if o[0] == 'ham': ham = o[1]
        if o[0] == 'spam': spam = o[1]
    return (ham,spam)

Small, and really simple module no ? The next step, simply add a ‘spam’ and ‘ham’ attributes on your comment post. And add two methods to train the comment as a spam or a ham.. And of course, only display comments which have a good ratio ( >1) ham/spam. This took me about 1 hour to implement…

After a week, of train, this is working very fine, not a single false positive, and it filter every spam since the first trains. As I get around 20 spams post per day, this is quite a good news ;)

Enjoy Bayesian ?

9 thoughts on “Howto to spam-protect your python-based blog with bayesian filter.”

Thanks, Reverend Jkx :)

So, now your comments RSS feed is usable, right? Because it still seem to contain some spam in there (article 239)..

Yes, my comment RSS is still full of SPAM. I need to apply the filter here to. Right now, I’m using this to check everything is Ok. I will switch soon.

Bye ..

What happens to the spam database file when two people submit a comment at the same time? Is there a way to prevent it from getting corrupted?

This depends on the way you plug this in your webapp, but you can easily protect the write with a lock.

How is the filter working out these days? Is this approach still worth implementing? Thanks, Mike

Spam Filtering may reduce the number of spam for a short while but you cant say that it is an ultimate solution to Spamming. The reason is that the Spammers are aware of these filtering techniques whether it is Filtering with BogoFire or some other. There are many websites available that are providing the information on Anti-Spamming Solutions but most of this information is either irrelevant or not useful. I have recently visited a website that I would like to suggest

Anti-Spam Solutions Website

I’ve been trying to implement Reverend as part of an rss feed aggregator I am working on — instead of filtering spam, it will be filtering by interest level (i.e. the “ham” to “spam” ratio reflects how interesting it thinks the article will be to the reader). Unfortunately, I can’t get it working properly — training the database works, as the database size keeps increasing, but the “guess” function returns either (0,0) or None depending on whether I indent the return or not (this result occurs even when I copy and paste previously classified text verbatim). Is there some “magic size” that the database has to grow to before it starts actually classifying entries, or am I missing something completely different? (I’ve tried first with your code converted into a class, and second by just copying and pasting your code).

*sigh* ignore the last comment — I figured it out (I was passing the word “description” rather than the string description). I need to stop coding at 3 AM local.

Jkx@home

Titanium Exposé

Howto to spam-protect your python-based blog with bayesian filter.

Related Posts

9 thoughts on “Howto to spam-protect your python-based blog with bayesian filter.”

Leave a Reply Cancel reply