POPFile Revisited

Filed under:General — eric @ 5:16 am

Back on the first of November, I told you about POPFile, an email classifier/spam filter that uses math algorithms from the 1700s. Now that three months have gone by, I thought I’d let you know how it’s worked.

I can safely say that this is the best thing that has ever happened to my email. You probably know that I’ve been on the ‘net for a long time, and I’ve never shied away from putting my email address out there. This means that I get a lot of spam. A Lot. And over the years, I’ve tried every spam filter at my disposal, and none were all that satisfactory. And then along came POPFile.

I’ve got quite a few email accounts, and for simplicity, they all eventually end up on the same computer. I try to have different accounts for different purposes, such as theater, farm, work, etc., but many of my friends overlap those purposes, so sorting by incoming address isn’t very reliable. POPFile handles that, too.

In POPFile, you create as many buckets as you want to sort your email. I’ve got six: farm, personal, spam, townandgown (my theater), work, and yahoolists (which actually is a general-purpose mailing list bucket). After you create your buckets, you feed them email as it comes in, and they quickly learn what emails to grab. The email then gets labeled (either by prepending the bucket name to the subject or adding a custom email header, depending on your email reader’s capabilities and your wishes). So, when you finally read it, it’s labeled and then sorted by your reader. Even with these six, somewhat fuzzily defined, buckets, POPFile has been very accurate:

Just over 98% — and most of the errors would have been hard for a human to classify correctly, such as when a co-worker sends me a theater-related email, for example. It was over 99% until a new version of POPFile came out this week. It has a better algorithm, but it had to re-learn a few emails to get things straight again. So, for over 9000 emails coming in, 177 got mis-labeled. A couple of those were “false-positives”, but they are extremely rare. Here’s how many were sent to each bucket:

All those emails, and 41% of them were spam. That’s several thousand trash emails over the last few months wasting my resources. I’ve got my email reader automatically deleting them, so I never have to see them any more. Yay!

The system works by building a dictionary of words, assigning probabilities to each so it can compile an overall probability that a given piece of email goes in a specific bucket. It takes a surprisingly small dictionary to achieve these results:

My work bucket has over a million words in it, but almost all of them are from a humongous attachment containing computer code that I accidentally fed it the first day I was using the system. The others range from about 8000 words up to 41000 words. That’s really not very many feedings. And when it gets something wrong, I correct it (using a very simple web interface), and that refines the dictionary accordingly.

POPFile is platform independent and requires only that you get your email through a POP account — so it won’t work on web-based systems like Yahoo mail, for instance. To make things even simpler for windows users, there is an exe installer available that does everything but configure your email reader and set up buckets, but the step-by-step instructions will guide you from there.

You might not be getting as much spam as I am (though it probably feels like it sometimes), but if you get any at all, POPFilter is worth it. It’s impossible for spammers to fool (unlike most of the other filters out there), so if enough people had systems like this in place, maybe the spammers would finally give it up.