A new approach to fighting spam?

Spam and the uncertainty that spam filters cause has dramatically reduced the effectiveness one of the most popular uses of the Internet. Maybe it’s time for a different approach to filtering email.

Phones and email are both about equally useful: given the choice between giving up their phone of giving up email, people are about evenly divided. When comparing e-mail to other Internet technologies, however, it's no contest. Given the choice between giving up email and giving up browser-based web access, people cheerfully give forgo the web in favor of email. The web may be nice to have, but email is a necessity, and most businesses really can’t function without it.

Unfortunately, almost all of today’s email traffic is spam so it’s necessary to separate the spam from the legitimate messages before they get to users’ inboxes. If you don’t do that, users are quickly overwhelmed by the sheer volume of spam that they receive.

Spammers are clever, however, and quickly find a way to get around the latest updates to spam filtering software. That’s possible because filtering applications use the latest models of what spam looks like to help them decide whether or not to let a message pass. Once spammers learn what filters look for, they quickly invent a way to get their spam past the filters.

Maybe an entirely new approach to filtering email is needed, and the fact that most email is actually spam may be the insight that we need for this. In particular, instead of trying to identify spam, why not try to identify legitimate email messages instead?

Note that this is entirely different from white-listing. With white-listing, an approved list of names, domains or IP addresses is used to allow incoming email. Instead, this approach looks at the content of an email and tries to decide if a particular message is legitimate. White listing doesn’t look for valid email messages. An entirely different model may be needed to do that.

Information security vendors have spent lots of time and effort over the past several years developing ways to identify spam. The benefits of this research have been temporary at best because spammers quickly learn to avoid the most recent versions of anti-spam filters. But while the arms race between spammers and anti-spam vendors has led to all sorts of unusual messages that are designed to pass filters undetected, the format of legitimate emails hasn’t changed much at all, and because of this, identifying legitimate emails may be a better strategy than identifying spam.

Looking for legitimate emails seems to be very simple to implement because it can be done with a minimal change to existing networks. All that’s needed is different logic on anti-spam filtering products. Everything else can stay the same. That seems much simpler than some of the alternatives that have been proposed. Simple is definitely good. It might even be effective.

I haven't heard of this approach being used. Maybe there's some obvious reason why it won't work.

  • Rob Adams

    Identifying legitimate messages is the same problem as identifying spam messages. The problem in both cases is that spam can look arbitrarily similar to legitimate messages.


  • Jim MacLeod

    Sorry, the core of your suggestion is already mainstream. The anti-spam statistical decision engines based on language processing, machine learning, sender reputation, etc., generally output a result which is essentially “percent chance this is spam” based on comparison between collections of “known-good” and “known-bad” emails. The hard part is bootstrapping the process – creating the collections of emails.
    Research focuses on categorizing spam because it’s a much easier task than categorizing natural language. Comparing the characteristic of spam versus not-spam, there are a lot more similarities between different spam emails than there are between not-spam emails. You suggest that the format of not-spam messages “hasn’t changed much” – but that assumes that there _is_ or ever was a standard format for email. If there were, spammers would of course copy it.
    Spam by its nature tends to have specific characteristics. Messages tend to be short, cover only 1 of about a dozen topics, and (these days) generally include a link to a site whose reputation is bad or which has only recently been registered. It’s fairly rare for an entirely new concept in spam to emerge – and once it does, if it’s effective, it’s copied fairly quickly, which means that most anti-spam vendors figure out how to kill it pretty quickly.
    Finally, there’s a wonderful paradox about spam: as spammers try to evade anti-spam filters, their messages increasingly tend to diverge from standard natural language – which makes it easier to identify a spam message on sight.
    Check out http://craphound.com/spamsolutions.txt for additional commentary on anti-spam research.


  • Luther Martin

    My thinking here was really based on discussions that I had with anti-spam vendors at the RSA show this year. We were taking about how image spam managed to appear and bypass most spam filters and I was told that this happened because the filters were looking for indicators of spam and not for indicators of a valid message.


