cs373

Writeups for CS 373 Defense against the Dark Arts, Winter 2018

View on GitHub

Week 8: Spam

This week was titled “messaging security”, but it is actually all about spam detection.


Let’s start with some basic terminology.

Blocking spam

A lot of spam blocking actually occurs at the network or SMTP level, long before the email hits your inbox (or spam folder).

One of the main weapons against spam is the RBL, Realtime Blackhole List. This is simply a blacklist of IPs which are known to be sending spam.

This is great because it blocks spam at the network level, which is low-resource. You simply configure your routers to block all packets coming from IPs on the RBL.

The downside is that IP addresses often change, and what was a spam IP yesterday might be a legitimate IP today, or vice versa. To keep up with the changes, you have to subscribe to a RBL from a company like Spamhaus, who reportedly owns most of the RBL market.

One of the ways Spamhaus gets the info for their RBLs is with PBLs - policy block lists. When an ISP allocateds a range of dynamic IPs, they tell Spamhaus and that range goes on a blocklist (consumers not allowed to run their own mail servers, you see.)


One step up from the network layer is the application layer.

SMTP (Simple Mail Transport Protocol) is the protocol which governs how mail is moved between servers. Email is a hop-by-hop business: you send your email to an outgoing SMTP server; it stores it and forwards it to another email server; and so on. Eventually it gets to its destination and you retrieve it from the server using POP3 or IMAP.

When an SMTP server wants to block a spam messages, it simply responds to the request with a 554 Denied code, which tells the sender to bugger off and not come back. It doesn’t generate a bounce message or anything, just drops the message on the floor and tells the sender to go away. This is the strongest response.

There are lesser error code that can be used for lower confidence spam, like 451 Temporary Error, which is also used for things like rate limits. The server could also store the message for later analysis, but if it later decides to block the message it would have to send a bounce message.

Or it could just accept the message and let the recipient deal with it. These are the messages that end up in your spam filter.

It’s better to stop spam earlier, though.

For spam blocking to work at this level, classification has to be quick. McAfee’s spam classifier aims to work in 20ms or less, for example.

Detecting spam

There are two general methods for detecting spam: rule-based, and Bayesian.

Rule-based

The rule-based approach is similar to what we discussed in week 3 for detecting viruses: manually-written rules which try to match strings in an email against known phrases used in spam. Like AV rules, these can either be fixed strings or regexs.

We can also look at features of the email other than its contents; for example, does the email have an attachment? does it contain a url? who is the sender?

We can construct even stronger rules by combining these sorts of features into “meta rules”; for example, if an email comes from Pfizer, has a subject containing “buy viagra”, contains an image (any image), contains a url (any url), and does not have attachment, then that’s spam.

Bayesian

That’s great but who wants to create a bunch of spam rules by hand? Nobody, that’s who.

Enter Bayes.

I believe the Bayesian approach to spam filtering was first described by Paul Graham in A Plan for Spam, an essay published in August 2002.

The basic idea is to use a statistical approach to automatically find words in the email that are highly associated with spam, instead of trying to find them by hand.

This idea became the SpamBayes project, an open-source Python program that used Bayesian filtering to flag spam. It enjoyed a few years of success, but it is now unmaintained. Its last release was in 2008.

Nowadays spammers have gotten more sophisticated and the best filters are run by large corporations like Google who can use a huge database of spam messages to train more sophisticated machine learning modules.

Just like in URL classification last week, it is more important to have few false positives than to catch all spam messages. However, note that because of the volume of spam

In which I ignore the lesson plan and do something else

We were supposed to come up with some home-grown rules for flagging spam messages, but since I already created a rule-based approach for URL classification last week so I decided to try the Bayesian approach this time.

We are interested in computing:

P(spam | word)

Which is read as “the probability that a message is spam, given that it contains the word word”.

Per Bayes’ Theorem,

P(spam | word) = P(word | spam) / P(spam)

This says that the probability that message is spam (given that it contains some word) is equal to the probability that that word appears in a spam message, divided by the overall probability that a message is spam.

(P(spam) represents our prior probability that we believe the message is spam, and P(spam|word) is the posterior probability after observing that it contains word.

We’ll assume P(spam) is .5 even though the actual distribution is closer .85, to avoid biasing the algorithm towards classifying messages as spam.)


That’s just the probability after observing a single word. What we really want is to compute the probability after seeing every word in the message.

P(spam | word1, word2, word3, ...)

If we assume that each word is independent (a patently false assumption, but nevertheless an important simplification of the problem), then we can combine the probabilities for each word via a simple formula.

Given two independent posterior probabilities p and q, the combined probability is

      p*q
----------------
p*q + (1-p)*(1-q)

Here’s the code if you want to see it: spam/bayes.py.


I should note a few deviations from the way i described the algorithm above:

Results

By training my Bayesian filter on the full database of 100,000 messages, i was able to flag messages correctly except for 420 cases, which is an accuracy rate of more than 99.5%. \o/

This, of course, is cheating, because you aren’t allowed to verify your algorithm on the same data set you used to train it. (Doing so is prone to overfitting and doesn’t give an accurate picture of how the algorithm really performs.)

So let’s try again.

After modifying the code to only train on a random sample of 50,000 messages (half of the full database), the number of misclassified messages skyrockets to about 4,500, for an accuracy rate of about 95.5%. /o\

This is not fantastic, but is still competitive.


In the recorded lectures, the class divided into three groups to try and classify the spam messages.

I want to repeat some of their methods here.

Lessons

Acknowledgements

This post is based on lectures given by Eric Peterson, Research Manager at McAfee.