Returning the Favour

September 14, 2004

After decades of scientists looking to computers and information technology (IT) to help them solve their research problems, it looks like the tables have turned and the scientific community is now helping out the computer world in their fight against the ever increasing problem of spamming.

After decades of scientists looking to computers and information technology (IT) to help them solve their research problems, it looks like the tables have turned and the scientific community is now helping out the computer world in their fight against the ever increasing problem of spamming.

Bioinformaticians, Isidore Rigoutsos and Tien Huynh at IBM's TJ Watson Research Centre have devised an anti-spam filter based on the way scientists analyse genetic sequences. The algorithm named Chung-Kwei after the Feng Shui character (the ancient Chinese symbol of protection) works by automatically learning patterns of spam (unsolicited email) vocabulary. Test performed by the scientists have proven the algorithm to be 96.56 percent efficient in detecting spam with only 0.066 percent, or one-in six-thousand false positive detected.

The research which started just over year ago and is part of SpamGuru, a collaborative anti-spam filtering solution that is currently under development at IBM Research, stemmed directly from another algorithm called Teiresias which was originally designed by the bioinformaticians to search for recurring patterns in different DNA and amino acid sequences to indicate role of genetic structure and variation. The algorithm determines protein structure based on the way they are strung together. The Chung-Kwei algorithm mimicked the process of DNA analysis by looking at long strings of characters which occurred in spam e-mail but not in normal non-spam e-mail (white e-mail). The research was performed on 66,000 training spam messages and 22,000 training non-spam messages being filtered through the algorithm.

White e-mails were distinguished from spams through the use of a score-based system. The Teiresias algorithm was first executed on a collection of spam e-mails to discover patterns that occur twice or more. Incoming e-mails were then processed to see if they matched any of the collected patterns. A high score was awarded to these incoming e-mails depending on how frequently a certain character string arose. Those with very few occurrences of spam-type characters received lower scores. This methodology is referred to as "guilt by association" and is one which commonly used in computational biology research and a number of life science and computer security applications. The algorithm was trained not to be deceived by the many common techniques devised by spammers to beat the system, such as cunningly replacing a certain character with a similar looking character, for example using "$" instead of "S".

Ever since the unravelling of the DNA (deoxyribonucleic acid) structure by James Watson and Francis Crick in 1953 (see Genome-IT, OBBeC.Com, April 2004), scientists have been seeking ways to analyse, interpret, store and manage information gathered from their research projects. Computers have proved to be the saving grace in helping researchers with these mission-critical and compute-intensive research. We have witnessed so many areas of science where computers and IT have come to the rescue of scientists, including molecular visualisation of protein structures, data analysis and storage, mapping and sequencing of the human genome as well as many others, and more recently, its ever increasing usage in the healthcare community (see Information Technology and the National Health Service, OBBeC.Com, June 2004). So it is a refreshing change to see that through biocomputing, life science researchers are returning the favour to the IT world.

 

Comments