Spam and OCR

It’s strange how the same techniques can be used to attack both sides of a problem. For some time now, some of the more sophisticated web spammers have been using OCR techniques to circumvent CAPTCHAs on websites in order to hijack free email accounts, submit comment spam on blogs, and similar forms of mischievousness.

As the more capable e-mail spammers seem to be figuring out that anti-spam technologies are getting pretty good at filtering out the crap they send, normally using rule-based detection, Bayesian learning, or a combination of the two, a lot of spam now being sent out is image-based - and anti-spammers are now using OCR to fight back against this new tide.

As I’ve mentioned before, I have a huge spam problem on my personal e-mail account (~4,000/week) - due to a combination of bad luck and some foolish naivety at a few points - and so I have a fairly highly-tuned SpamAssassin installation running at home, with plenty of custom rules and plugins. I’ve seen a rising amount of image spam on it, so I decided to give FuzzyOcr, a plugin for SpamAssassin, a try. So far, the results are pretty impressive. FuzzyOcr uses the open-source gocr program as the engine, and ties it to with SpamAssassin and some logic. The OCR is fairly CPU-intensive, so unlike most SpamAssassin plugins, it only kicks in if the message is otherwise going to be below a certain scoring threshold. So far it has roughly halved the volume of spam that slips through into my inbox (previously ~40-50/day), which is a welcome improvement.

However, fun though they are as a technical challenge, technical approaches such as these always feel like fighting a losing battle. I might write a lengthier article on this at a later date, but I’d like to see ISPs take a far more hardline attitude with their peers that host spammers. There are also compelling economic solutions to the problem, mostly related to micro-payments for sending email. There are problems with those too (how do you roll them out gradually?), but you rarely see graphs of spam that have a downward trend - a solution to the spam problem would be most welcome.