Bayesian self-learning in Kerio Connect - Knowledgebase

Bayesian self-learning in Kerio Connect Print

There are many problems associated with detecting spam for the final recipient of an email. It is important to understand these problems in order to understand what Bayesian self-learning is and how it fits into Kerio's solution for spam protection.

Terminology

Spam is a message the recipient considers an unsolicited junk email.
Ham is a message the recipient considers to be not spam.
False Positive is a message that is incorrectly marked as spam.
False Negative is a message that is incorrectly marked as ham.

SpamAssassin

SpamAssassin uses static rule sets to determine if a message is spam.

Fixed set of rules cannot accurately define spam for everybody. It may result in SpamAssassin capturing most spam, however, it will always have some false positives and false negatives.

Also, the content in spam changes over time and the spam mutates. Unless the rules in SpamAssassin change, too, more and more spam gets in. Therefore, constant upgrades are necessary to maximize the spam blocking capabilities.

Bayesian filtering

Recipients can train the Bayes database to recognize messages as spam or ham. The filter breaks messages into small pieces called tokens and determines which tokens occur mostly in spam messages, and which tokens occur mostly in ham messages.

The Bayes database must learn a lot of emails before it can function effectively. In general, the Bayes database begins to work after it has learned at least 200 spams and 200 hams. End-users must train the Bayes database enough to effectively fight mutating spam.