Skip to content

Proceedings - Spam 2006

TREC 2006 Spam Track Overview

Gordon V. Cormack

Abstract

TREC's Spam Track uses a standard testing framework that presents a set of chronologically ordered email messages a spam filter for classification. In the filtering task, the messages are presented one at at time to the filter, which yields a binary judgement (spam or ham [i.e. non-spam]) which is compared to a human-adjudicated gold standard. The filter also yields a spamminess score, intended to reflect the likelihood that the classified message is spam, which is the subject of post-hoc ROC (Receiver Operating Characteristic) analysis. Two forms of user feedback are modeled: with immediate feedback the gold standard for each message is communicated to the filter immediately following classification; with delayed feedback the gold standard is communicated to the filter sometime later, so as to model a user reading email from time to time in batches. A new task - active learning - presents the filter with a large collection of unadjudicated messages, and has the filter request adjudication for a subset of them before classifying a set of future messages. Four test corpora - email messages plus gold standard judgements - were used to evaluate subject filters. Two of the corpora (the public corpora, one English and one Chinese) were distributed to participants, who ran their filters on the corpora using a track-supplied toolkit implementing the framework. Two of the corpora (the private corpora) were not distributed to participants; rather, participants submitted filter implementations that were run, using the toolkit, on the private data. Nine groups participated in the track, each submitting up to four filters for evaluation in each of the three tasks (filtering with immediate feedback; filtering with delayed feedback; active learning).

Bibtex
@inproceedings{DBLP:conf/trec/Cormack06,
    author = {Gordon V. Cormack},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} 2006 Spam Track Overview},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/SPAM06.OVERVIEW.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Cormack06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

OSBF-Lua - A Text Classification Module for Lua: The Importance of the Training Method

Fidelis Assis

Abstract

OSBFL ua is a C module for the Lua language which implements a Bayesian classifier enhanced with Orthogonal Sparse Bigrams OSB for feature extraction and Exponential Differential Document Count EDDC - for feature selection. These two techniques, combined with the new training method introduced for TREC 2006 produce a highly accurate filter, yet very fast and economic in resources. OSBFL ua is an Open Source Software available from http://osbf-lua.luaforge.net/. spamfilter.lua is a productionc lass antis pam filter available in the same package.

Bibtex
@inproceedings{DBLP:conf/trec/Assis06,
    author = {Fidelis Assis},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {OSBF-Lua - {A} Text Classification Module for Lua: The Importance of the Training Method},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/fidelis-assis.spam.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Assis06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Towards Practical PPM Spam Filtering: Experiments for the TREC 2006 Spam Track

Andrej Bratko, Bogdan Filipic, Blaz Zupan

Abstract

This paper summarizes our participation in the TREC 2006 spam track. We submitted a single filter for the evaluation, based on the Prediction by Partial Matching compression scheme, a method that performed well in the previous TREC evaluation. A major focus of our effort was to improve efficiency of the method, particularly in terms of memory consumption, in order to establish whether compression-based filters are in fact a viable solution for practical applications. Our system exhibited fair performance, despite the fact that the filtering techniques remained virtually unchanged from the previous evaluation. We did not investigate methods for tackling delayed user feedback. A very simple strategy of training on most recent examples was used for the active learning task, and found to work surprisingly well given its simplicity.

Bibtex
@inproceedings{DBLP:conf/trec/BratkoFZ06,
    author = {Andrej Bratko and Bogdan Filipic and Blaz Zupan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Towards Practical {PPM} Spam Filtering: Experiments for the {TREC} 2006 Spam Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/jozef-stefan.spam.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BratkoFZ06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Highly Scalable Discriminative Spam Filtering

Michael Brückner, Peter Haider, Tobias Scheffer

Abstract

This paper discusses several lessons learned from the SpamTREC 2006 challenge. We discuss issues related to decoding, preprocessing, and tokenization of email messages. Using the Winnow algorithm with orthogonal sparse bigram features, we construct an efficient, highly scalable incremental classifier, trained to maximize a discriminative optimization criterion. The algorithm easily scales to millions of training messages and millions of features. We address the composition of training corpora and discuss experiments that guide the construction of our SpamTREC entry. We describe our submission for the filtering tasks with periodical re-training and active learning strategies, and report on the evaluation on the publicly available corpora.

Bibtex
@inproceedings{DBLP:conf/trec/BrucknerHS06,
    author = {Michael Br{\"{u}}ckner and Peter Haider and Tobias Scheffer},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Highly Scalable Discriminative Spam Filtering},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/humboldtu.spam.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BrucknerHS06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers

David Sculley, Gabriel Wachman, Carla E. Brodley

Abstract

Contemporary spammers commonly seek to defeat statistical spam filters through the use of word obfuscation. Such methods include character level substitutions, repetitions, and insertions to reduce the effectiveness of word-based features. We present an efficient method for combating obfuscation through the use of inexact string matching kernels, which were first developed to measure similarity among mutating genes in computational biology. Our system avoids the high classification costs associated with these kernel methods by working in an explicit feature space, and employs the Perceptron Algorithm using Margins for fast on-line training. No prior domain knowledge was incorporated into this system. We report strong experimental results on the TREC 2006 spam data sets and on other publicly available spam data, including near-perfect performance on the TREC 2006 Chinese spam data set. These results invite further exploration of the use of inexact string matching for spam filtering.

Bibtex
@inproceedings{DBLP:conf/trec/SculleyWB06,
    author = {David Sculley and Gabriel Wachman and Carla E. Brodley},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/tuftsu.spam.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/SculleyWB06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

SVM-Based Spam Filter with Active and Online Learning

Qiang Wang, Yi Guan, Xiaolong Wang

Abstract

A realistic classification model for spam filtering should not only take account of the fact that spam evolves over time, but also that labeling a large number of examples for initial training can be expensive in terms of both time and money. This paper address the problem of separating legitimate emails from unsolicited ones with active and online learning algorithm, using a Support Vector Machines (SVM) as the base classifier. We evaluate its effectiveness using a set of goodness criteria on TREC2006 spam filtering benchmark datasets, and promising results are reported.

Bibtex
@inproceedings{DBLP:conf/trec/WangGW06,
    author = {Qiang Wang and Yi Guan and Xiaolong Wang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {SVM-Based Spam Filter with Active and Online Learning},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangGW06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

BUPT at TREC 2006: Spam Track

Zhen Yang, Wei Xu, Bo Chen, Weiran Xu, Jun Guo

Abstract

This report summarizes our participation in the TREC 2006 spam track, in which we consider the use of Bayesian models for the spam filtering task. Firstly, our anti-spam filter, Kidult, is briefly introduced. And then we try to use weighted adjustment of separating hyperplane and selective classifiers ensemble to improve the filtering performance. Finally, we summarize the relevant results from the official evaluation.

Bibtex
@inproceedings{DBLP:conf/trec/YangXCXG06,
    author = {Zhen Yang and Wei Xu and Bo Chen and Weiran Xu and Jun Guo},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{BUPT} at {TREC} 2006: Spam Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/beijing-upt.spam.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/YangXCXG06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Seven Hypothesis about Spam Filtering

William S. Yerazunis

Abstract

For TREC 2006, the CRM114 team considered several different hypothesis on the topic of spam filtering. The hypothesis were that: 1 Spammers were changing tactics to successfully evade contentba sed spam filters; 2 A pretrained database of known spam and nonspam improves overall accuracy; 3 Repeated training methods are more effective than singlepa ss Train Only Errors training 4 KNN/Hyperspace classifiers are more effective than classical Bayesian or Markovian classifiers 5 Delaying feedback learning results in degraded filter accuracy 6 Bite ntropy filters are as good or better than tokenizing filters and aftert hefa ct: 7 1R OCA% is the best figure of merit for spam filters Of these hypothesis, we found that spammers were not significantly able to evade content based spam filters, that pretraining is probably not helpful, that repeatedpa ss training is not significantly helpful, that KNNs are of roughly equal accuracy to computationally and storagee quivalent Markov classifiers, that delayed feedback is only marginal in impacting filter accuracy, and that despite their highly counterintuitive design, bite ntropic filters are capable of similar or better accuracy to tokenizing filters. We also found a fascinating counterc orrellation between 1R OCA% and the final accuracy of a filter (the accuracy of the filter for the final 10% of the corpus).

Bibtex
@inproceedings{DBLP:conf/trec/Yerazunis06,
    author = {William S. Yerazunis},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Seven Hypothesis about Spam Filtering},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/crm.spam.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Yerazunis06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}