Spam Classification/Filtering

From: Dave Hall <linux_at_no.spam.please>
Date: Tue Sep 16 2003 - 17:04:02 CST

On Tue, Sep 16, 2003 at 03:54:41PM -0600, Les Klassen Hamm wrote:
> Dave,
> What kind of system load does it generate while filtering?
> Does it slow notably when filtering email with attachments?
> I guess that depends on how beefy your mail server is, but I'd be interested to
> know.

Spambayes is designed to work on a per-user basis since everyones spam and
ham is a little different. So far I'm just using it for my mailbox and a
couple of others I use for mailing lists. It processes probably 300-500
messages a day typically and the load is not noticeable. My server is an
800MHz slot-A Athlon with 512MB of RAM installed (top reports about 3/4
of that is free and swap hasn't been touched). The server runs Apache,
Zope, qmail, MySQL, Postgres (inactive) and tinydns/dnscache. The normal
load average is about 0.30-ish

As for attachments, I'm not sure how it handles those. If it's base64,
it would be a single token and that could make the db very large so I would
guess it ignores them. I don't know about text or HTML, I think those are
parsed. I'm no lazy to search the mailing list archives to confirm this.
HTML is tokenized into it's readable form, basically converted to text so
all the tricks with comments and weird hacks are moot.

Big resources for spambayes would be space for each user's db. Hard drive
to store it primarily. It can run with either a Berleley style db which
takes a little longer for training but is much faster when classifying or a
python pickle file which is faster for training but slower (and probably
more memory intensive) for classifying. My db is about 2.5MB presently

There are a couple of options to integrate with POP or IMAP clients that
are basically proxy servers. I suspect the performance would be similar or
perhaps a little better since they would already have python loaded.

I retrain in batch, so far just once. It takes about 5 minutes to train
on 2000 messages. That's mostly disc activity, on OpenBSD. Linux might
be a little faster, ext2/3 usually performs better than ffs.

| <- You must be smarter than this stick to ride
     the Internet		-Mike Handler
Received on Tue Sep 16 17:04:02 2003

This archive was generated by hypermail 2.1.8 : Mon Mar 06 2006 - 18:35:12 CST