Tuesday, October 14, 2003

Spam spam spam spam spam eggs sausage and spam

I delete, on a daily basis, about one hundred spam emails. Now don't get me wrong, I appreciate the fact that there are legions of marketroids out there who are clearly very concerned about my penis size, my breast size, my mortgage rates, and my access to photos of 'hawt m0ms'; I'm touched, really, to know that so many entrepreneurs are thinking of me. But sometimes I just wish to read mail from my friends, and I have to wade through endless spam to find it. Even worse, there are times when the nature of a specific message isn't clear, and I have to open up an email to determine its contents.

Now I'm not doing anything foolish like using Outlook for an email client, or reading web-based mail with Internet Explorer, so I am relatively insulated from the worst of the virii and attachments that I receive each day. My mail client doesn't injudiciously run any code that comes along, and it also doesn't hide the true content of an email from me the way Outlook can. But while my computer is safe and free from infection, I find that I myself suffer from the viral memes that I am exposed to in the process of managing the spam. It has nothing to do with content; just the fact that so much of my time, attention, and energy are wasted on dealing with this issue gives me a sinking feeling... I already find but little joy in my dealings with the humans, and now I find that even from my sickbed I cannot avoid the worst of them. I have seen the world's digital face, and for the most part it isn't pretty.

Some bright folks have been bending their will toward the resolution of this problem. Paul Graham basically jumpstarted the whole discussion with A Plan For Spam, wherein he recommends Bayesian filtering as an effective technique for spam reduction. As I see it, there are two problems with this approach: first, the likelyhood of false positives is non-zero, so you still must dig through every mail to check the accuracy of the filter, and second, Bayesian filters work by analyzing the 'spaminess' of specific words and/or 'tokens' (a token can be a word, or it could be HTML code, or anything; for instance, the token '#FF0000', which represents the color red in HTML, is found more often in spam that is using colors to get your attention than it is in email from your friends), which can be easily gotten around by deliberately mangling the spelling of words in the email.

There's also been a lot of talk about challenge-response systems, wherein any email that doesn't come from a known and trusted source is held in check while a reply is sent back telling the sender how to get themselves added to the "I'm legitimate" list... while effective, this is pretty damn annoying and just triples the bandwidth required to send one email. One of the basic premises of this approach is that the large-scale economics of spam distribution dictate that spam senders can't be bothered to play this silly little game, so they never respond to the challenge email and after some preset time the server can just throw the email away without you ever seeing it, but as spam detection becomes more sophisticated, spammers continue to up the ante and it won't be long before this technique is rendered useless. It is also prone to errors... imagine if both people had the same system, bouncing challenge emails back and forth between the servers ad infinitum...

Blacklists are not useful, since spammers can camouflage themselves pretty effectively. Mail sent through open relays that do not require authentication can have zero truthful indentifying characteristics, if the sender so chooses.

Spam of the future will likely be bulletproof... filtering that is based on content can be sidestepped if there is no objectionable content, From: headers can be forged, and a simple "Hi- here is the link I told you about" email is too innocuous to set off any alarms.

It seems to me that one way to defeat spam is to attack it at the server level. If an email shows up for me and ten other people in my domain (usually in an alphabetic list), there is a higher likelihood of that email being spam than there is that all of us belong to the same mailing list. (I can effectively block about 30 percent of my spam by filtering out any mail that contains the name of the user who precedes me alphabetically). Blocking at the server level means that less time is wasted, less bandwidth is used, less frustration is generated. But it requires an unlikely commitment from all ISPs.

I would like to see a spam filter that has the following characteristics:

  • a whitelist, for known friends (always accept)
  • a simple blacklist (always reject)
  • two levels of filtering:

    • mail that is explicitly addressed to me in the To: or Cc: fields goes through one level of filtering
    • mail that is not explicitly addressed to me (could be a Bcc:) undergoes a more thorough filtering

  • most importantly: all components of the filter should be configurable with regular expressions and boolean logic

That last bit is key: many email clients have built-in filtering that is too dumbed down to be of any use against spam; these are more useful for moving emails into specific mailboxes than for critical examination of the validity of a message. A client (or even better, a server-side script working with procmail) that provided a logical framework for spam detection via tokens would be awesome... allowing us to make rules like [if (any recipient = me@my.server) and (any other recipient = someone_else@my.server who is not on my whitelist) then this is SPAM].

This, then, brings up the crux of the issue: each person should be free to define spam differently, according to their tastes, needs, and desires. Giving people the ability to filter based on their personal definition of spam would be the first step.

I guess I better get busy writing my own filter...