Skip to content

Proceedings - Spam 2005

TREC 2005 Spam Track Overview

Gordon V. Cormack, Thomas R. Lynam

Abstract

The robust retrieval track explores methods for improving the consistency of retrieval technology by focusing on poorly performing topics. The retrieval task in the track is a traditional ad hoc retrieval task where the evaluation methodology emphasizes a system's least effective topics. The 2005 edition of the track used 50 topics that had been demonstrated to be difficult on one document collection, and ran those topics on a different document collection. Relevance information from the first collection could be exploited in producing a query for the second collection, if desired. The main measure for evaluating system effectiveness is “gmap”, a variant of the traditional MAP measure that uses a geometric mean rather than an arithmetic mean to average individual topic results. As in previous years, the most effective retrieval strategy was to expand queries using terms derived from additional corpora. The relative difficulty of topics differed across the two document sets. Systems were also required to rank the topics by predicted difficulty. This task is motivated by the hope that systems will eventually be able to use such predictions to do topic-specific processing. This remains a challenging task. Since difficulty depends on more then the topic set alone, prediction methods that train on data from other test collections do not generalize well.

Bibtex
@inproceedings{DBLP:conf/trec/CormackL05,
    author = {Gordon V. Cormack and Thomas R. Lynam},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} 2005 Spam Track Overview},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CormackL05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track

Fidelis Assis, William S. Yerazunis, Christian Siefkes, Shalendra Chhabra

Abstract

This paper discusses the design decisions underlying the CRM114 Discriminator software, how it can be configured as a spam filter, and what we may glean from the preliminary TREC 2005 results. Unlike most other filters, CRM114 is not a fixedpurpos e antis pam filter; rather, it's a general purpose language meant to expedite the creation of text filters. The pluggable CRM114 architecture allows rapid prototyping and easy support of multiple classifier engines; rather than testing different cutoff parameters, the CRM114 TREC test set tested different classifier algorithms and learning protocols.

Bibtex
@inproceedings{DBLP:conf/trec/AssisYSC05,
    author = {Fidelis Assis and William S. Yerazunis and Christian Siefkes and Shalendra Chhabra},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{CRM114} versus Mr. {X:} {CRM114} Notes for the {TREC} 2005 Spam Track},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/crm.spam.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AssisYSC05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

DBACL at the TREC 2005

L. A. Breyer

Abstract

The dbacl classifier is an open source command line tool which performs spam classification based on simple maximum entropy models; dbacl learns each class separately, and individual categories can be mixed and matched for classification. Here we present the simulation results obtained for TREC 2005, with an empirical comparison of several feature extraction methods. We also try to gain insight into their different performance characteristics, with limited success.

Bibtex
@inproceedings{DBLP:conf/trec/Breyer05,
    author = {L. A. Breyer},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{DBACL} at the {TREC} 2005},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/breyer.laird-spam.ps},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Breyer05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track

Andrej Bratko, Bogdan Filipic

Abstract

This paper summarizes our participation in the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. We experimented with two different compression algorithms under varying model parameters. All four filters that we submitted exhibited strong performance in the official evaluation, indicating that data compression models are well suited to the spam filtering problem.

Bibtex
@inproceedings{DBLP:conf/trec/BratkoF05,
    author = {Andrej Bratko and Bogdan Filipic},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Spam Filtering Using Character-Level Markov Models: Experiments for the {TREC} 2005 Spam Track},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/jozef-stefan.bratko.spam.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BratkoF05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

York University at TREC 2005: SPAM Track

Wei Cao, Aijun An, Xiangji Huang

Abstract

We propose a variant of the k-nearest neighbor classification method, called instance-weighted k-nearest neighbor method, for adaptive spam filtering. The method assigns two weights, distance weight and correctness weight, to a training instance, and makes use of the two weights when classifying a new email. The correctness weight is also used in the maintenance of the training data to make the training data more adaptive to the changes of spam characteristics. We submitted 4 spam filters to the Spam Track. Two of the filters are purely based on the instance-weighted kNN method. The two other filters combine the kNN method with other spam filtering and classification techniques. We report the official results of our submissions on the Spam Track evaluation data sets.

Bibtex
@inproceedings{DBLP:conf/trec/CaoAH05,
    author = {Wei Cao and Aijun An and Xiangji Huang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {York University at {TREC} 2005: {SPAM} Track},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/yorku-huang.spam.pdf},
    timestamp = {Sun, 02 Oct 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/CaoAH05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

DalTREC 2005 Spam Track: Spam Filtering Using N-gram-based Techniques

Vlado Keselj, Evangelos E. Milios, Andrew Tuttle, Singer Wang, Roger Zhang

Abstract

We briefly describe DalTREC 2005 Spam submission. DalTREC is the TREC research project at Dalhousie University. Four packages were submitted and they resulted in a median performance. The results are interesting and may be seen positive in the light of simplicity of our approaches.

Bibtex
@inproceedings{DBLP:conf/trec/KeseljMTWZ05,
    author = {Vlado Keselj and Evangelos E. Milios and Andrew Tuttle and Singer Wang and Roger Zhang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {DalTREC 2005 Spam Track: Spam Filtering Using N-gram-based Techniques},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/dalhousieu.spam.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/KeseljMTWZ05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

A TREC Along the Spam Track with SpamBayes

Tony Andrew Meyer

Abstract

This paper describes the SpamBayes submissions made to the Spam Track of the 2005 Text Retrieval Conference (TREC). SpamBayes is briefly introduced, but the paper focuses more on how the submissions differ from the standard installation. Unlike in the majority of earlier publications evaluating the effectiveness of SpamBayes, the fundamental ‘unsure' range is discussed, and the method of removing the range is outlined. Finally, an analysis of the results of the running the four submissions through the Spam Track ‘jig' with the three private corpora and one public corpus is made.

Bibtex
@inproceedings{DBLP:conf/trec/Meyer05,
    author = {Tony Andrew Meyer},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {A {TREC} Along the Spam Track with SpamBayes},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/masseyu.spam.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Meyer05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

IBM SpamGuru on the TREC 2005 Spam Track

Richard B. Segal

Abstract

BM Research is developing an enterpriseclass anti-spam filter as part of our overall strategy of attacking the Spam problem on multiple fronts. Our anti-spam filter, SpamGuru, mirrors this philosophy by incorporating several different filtering technologies and intelligently combining their output to produce a single spamminess rating. The use of multiple algorithms improves the system's effectiveness and makes it more difficult for spammers to attack. While our overall performance was strong, our results did uncover some flaws and weaknesses in our existing implementation. Our latest code, with these weaknesses addressed as well as other enhancements, produces results on par with the best performing classifiers reported for TREC 2005 on the public corpus.

Bibtex
@inproceedings{DBLP:conf/trec/Segal05,
    author = {Richard B. Segal},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{IBM} SpamGuru on the {TREC} 2005 Spam Track},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/ibm-segal.spam.pdf},
    timestamp = {Sun, 29 Aug 2021 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/Segal05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Simple Language Models for Spam Detection

Egidio Terra

Abstract

For this year's Spam track we used classifiers based on language models. These models are used to compute the log-likelihood for each individual message and then classify them as either ham or spam. Different data sets were used to train these language models. Our approach is simple, we initially create simple unigram language models and smooth the probabilities of unseen tokens by means of the expected likelihood estimator with a small discount probability tuned in a training corpus.

Bibtex
@inproceedings{DBLP:conf/trec/Terra05,
    author = {Egidio Terra},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Simple Language Models for Spam Detection},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/pontificiau.spam.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Terra05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

Shuhua Wang, Bin Wang, Hao Lang, Xueqi Cheng

Abstract

This paper introduces our work in the TREC2005 SPAM track. Naïve Bayes and Littlestone's Winnow are chosen as our basic classifiers. In our investigation, we found that when the structures of Ham and Spam are very different, the feature distributions of them vary a lot. Thus the factor of structure is introduced into our filter. Besides textual word feature, some kind of other features are also considered in our filter. Our experimental results show that Winnow outperforms Naïve Bayes and the multi-feature model outperforms structure based model.

Bibtex
@inproceedings{DBLP:conf/trec/WangWLC05,
    author = {Shuhua Wang and Bin Wang and Hao Lang and Xueqi Cheng},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{CAS-ICT} at {TREC} 2005 {SPAM} Track: Using Non-Textual Information to Improve Spam Filtering Performance},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/chinese-acad-sci-bin.spam.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangWLC05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

PRIS Kidult Anti-SPAM Solution at the TREC 2005 Spam Track: Improving the Performance of Naive Bayes for Spam Detection

Zhen Yang, Weiran Xu, Bo Chen, Jiani Hu, Jun Guo

Abstract

Recently, the spam already constituted a serious problem for both e-mail users and Internet Service Providers (ISP). Solutions to the abuse of spam would be both technical and legal regulatory. This paper reports our solution for the TREC 2005 spam track, in which we consider the use of Naive Bayes spam filter for its desirable properties (simplicity, low time and memory requirements, etc.). Then the approaches to modify the Naive Bayes by simply introducing weight and classifier assemble based on dynamic threshold are proposed, which can help to improve the accuracy of a Naive Bayes spam classifier dramatically. Additionally, we discuss some steps that must be adopted naturally thought before, such as stop list, word stemming, feature selection, class prior probabilities. The theory analysis implies these steps are not necessarily the best way to extend the Bayesian classifier, and these were also verified empirically. Many of these techniques appear to be counterintuitive but can be explained by the statistical properties of e-mail itself. Experiment results of TREC 2005 spam track demonstrate the effectiveness of the proposed method.

Bibtex
@inproceedings{DBLP:conf/trec/YangXCHG05,
    author = {Zhen Yang and Weiran Xu and Bo Chen and Jiani Hu and Jun Guo},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{PRIS} Kidult Anti-SPAM Solution at the {TREC} 2005 Spam Track: Improving the Performance of Naive Bayes for Spam Detection},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/beijingu-of-pt.spam.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/YangXCHG05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

WIDIT in TREC 2005 HARD, Robust, and SPAM Tracks

Kiduk Yang, Ning Yu, Nicholas George, Aaron Loehrlein, David McCaulay, Hui Zhang, Shahrier Akram, Jue Mei, Ivan Record

Abstract

Web Information Discovery Tool (WIDIT) Laboratory at the Indiana University School of Library and Information Science participated in the HARD, Robust, and SPAM tracks in TREC-2005. The basic approach of WIDIT is to combine multiple methods as well as to leverage multiple sources of evidence. Our main strategies for the tracks were: query expansion and fusion optimization for the HARD and Robust tracks; and combination of probabilistic, rule-based, pattern-based, and blacklist email filters for the SPAM track.

Bibtex
@inproceedings{DBLP:conf/trec/YangYGLMZAMR05,
    author = {Kiduk Yang and Ning Yu and Nicholas George and Aaron Loehrlein and David McCaulay and Hui Zhang and Shahrier Akram and Jue Mei and Ivan Record},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{WIDIT} in {TREC} 2005 HARD, Robust, and {SPAM} Tracks},
    booktitle = {Proceedings of the Fourteenth Text REtrieval Conference, {TREC} 2005, Gaithersburg, Maryland, USA, November 15-18, 2005},
    series = {{NIST} Special Publication},
    volume = {500-266},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2005},
    url = {http://trec.nist.gov/pubs/trec14/papers/indianau-bloom.hard.robust.spam.pdf},
    timestamp = {Wed, 29 Jun 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/YangYGLMZAMR05.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}