Skip to content

Proceedings - Legal 2008

Douglas W. Oard, Björn Hedin, Stephen Tomlinson, Jason R. Baron

Abstract

TREC 2008 was the third year of the Legal Track, which focuses on evaluation of search technology for discovery of electronically stored information in litigation and regulatory settings. The track included three tasks: Ad Hoc (i.e., single-pass automatic search), Relevance Feedback (two-pass search in a controlled setting with some relevant and nonrelevant documents manually marked after the first pass) and Interactive (in which real users could iteratively refine their queries and/or engage in multi-pass relevance feedback). This paper describes the design of the three tasks and presents the official results.

Bibtex
@inproceedings{DBLP:conf/trec/OardHTB08,
    author = {Douglas W. Oard and Bj{\"{o}}rn Hedin and Stephen Tomlinson and Jason R. Baron},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2008 Legal Track},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/OardHTB08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Brian Almquist, Yelena Mejova, Viet Ha-Thuc, Padmini Srinivasan

Abstract

The University of Iowa Team, coordinated by Padmini Srinivasan, participated in the legal discovery and relevance feedback tracks of TREC-2008. This is our second year participating in the legal track and the first year in the relevance feedback track.

Bibtex
@inproceedings{DBLP:conf/trec/AlmquistMHS08,
    author = {Brian Almquist and Yelena Mejova and Viet Ha{-}Thuc and Padmini Srinivasan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Iowa at {TREC} 2008 Legal and Relevance FeedbackTracks},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/uiowa.legal.rf.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AlmquistMHS08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Gianni Amati, Marco Bianchi, Mauro Draoli, Alessandro Celi, Giorgio Gambosi, Giovanni Stilo

Abstract

The TREC Legal track was introduced in TREC 2006 with the claimed purpose of to evaluate the efficacy of automated support for review and production of electronic records in the context of litigation, regulation and legislation. The TREC Legal track 2008 runs three tasks: (1) an automatic ad hoc task, (2) an automatic relevance feedback task, and (3) an interactive task. We have only taken part in the automatic ad hoc task of the TREC Legal track 2008, and focused on the following issues: 1. Indexing. The CDIP test collection is characterized by an large number of unique terms due to OCR mistakes. We have defined a term selection strategy to reduce the number of terms, as described in Section 2. 2. Querying. The analysis of the past TREC results for the Legal track showed that the best retrieval strategy basically returned a ranked list of the boolean retrieved documents. As a consequence, we have defined a strategy aimed to boost the score of documents satisfying the final negotiated boolean query. Furthermore, we defined a method for automatic construction of a weighted query from the request text, as reported in Section 3. 3. Estimation of the K value. We have used a query performance prediction approach to try to estimate K values. The query weighting model that we have adopted is described in Section 4. Submitted runs and their evaluation are reported in Section 5.

Bibtex
@inproceedings{DBLP:conf/trec/AmatiBDCGS08,
    author = {Gianni Amati and Marco Bianchi and Mauro Draoli and Alessandro Celi and Giorgio Gambosi and Giovanni Stilo},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {CNIPA, {FUB} and University of Rome "Tor Vergata" at {TREC} 2008 Legal Track},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/cnipa.legal.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AmatiBDCGS08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Where to Stop Reading a Ranked List?

Avi Arampatzis, Jaap Kamps

Abstract

We document our participation in the TREC 2008 Legal Track. This year we focused solely on selecting rank cut-offs for optimizing the given evaluation measure per topic.

Bibtex
@inproceedings{DBLP:conf/trec/ArampatzisK08,
    author = {Avi Arampatzis and Jaap Kamps},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Where to Stop Reading a Ranked List?},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/uamsterdam-kamps.legal.rev.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ArampatzisK08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Christopher Hogan, Dan Brassil, Shana M. Rugani, Jennifer Reinhart, Misti Gerber, Teresa Jade

Abstract

Treating the information retrieval task as one of classification has been shown to be the most effective way to achieve high performance on a particular task. In this paper, we describe a hybrid human-computer system that addresses the problem of achieving high performance on IR tasks by systematically and replicably creating large numbers of document assessments. We demonstrate how User Modeling, Document Assessment and Measurement combine to provide a shared understanding of relevance, a means for representing that understanding to an automated system, and a mechanism for iterating and correcting such a system so as to converge on a desired result.

Bibtex
@inproceedings{DBLP:conf/trec/HoganBRRGJ08,
    author = {Christopher Hogan and Dan Brassil and Shana M. Rugani and Jennifer Reinhart and Misti Gerber and Teresa Jade},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{H5} at {TREC} 2008 Legal Interactive: User Modeling, Assessment {\&} Measurement},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/h5.legal.rev.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/HoganBRRGJ08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Distributed EDLSI, BM25, and Power Norm at TREC 2008

April Kontostathis, Andrew Lilly, Raymond J. Spiteri

Abstract

This paper describes our participation in the TREC Legal competition in 2008. Our first set of experiments involved the use of Latent Semantic Indexing (LSI) with a small number of dimensions, a technique we refer to as Essential Dimensions of Latent Semantic Indexing (EDLSI). Because the experimental dataset is large, we designed a distributed version of EDLSI to use for our submitted runs. We submitted two runs using distributed EDLSI, one with k = 10 and another with k = 41, where k is the dimensionality reduction parameter for LSI. We also submitted a traditional vector space baseline for comparison with the EDLSI results. This article describes our experimental design and the results of these experiments. We find that EDLSI clearly outperforms traditional vector space retrieval using a variety of TREC reporting metrics. We also describe experiments that were designed as a followup to our TREC Legal 2007 submission. These experiments test weighting and normalization schemes as well as techniques for relevance feedback. Our primary intent was to compare the BM25 weighting scheme to our power normalization technique. BM25 outperformed all of our other submissions on the competition metric (F1 at K) for both the ad hoc and relevance feedback tasks, but Power normalization outperformed BM25 in our ad hoc experiments when the 2007 metric (estimated recall at B) was used for comparison.

Bibtex
@inproceedings{DBLP:conf/trec/KontostathisLS08,
    author = {April Kontostathis and Andrew Lilly and Raymond J. Spiteri},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Distributed EDLSI, BM25, and Power Norm at {TREC} 2008},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/ursinus-college.legal.rev.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/KontostathisLS08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Thomas R. Lynam, Gordon V. Cormack

Abstract

Our TREC 2008 effort used fusion IR methods identical to those used for our TREC 2007 effort; in addition we used logistic regression to attempt to learn the optimal K value for the primary F1@K measure introduced at TREC 2008. We used the Wumpus search engine combining several methods that have proven successful, including cover density ranking and Okapi BM25 ranking, and combination methods. Stepwise logistic regression was used to estimate K using TREC 2007 results as training data.

Bibtex
@inproceedings{DBLP:conf/trec/LynamC08,
    author = {Thomas R. Lynam and Gordon V. Cormack},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {MultiText Legal Experiments at {TREC} 2008},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/uwaterloo-cormack.legal.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LynamC08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Stephen Tomlinson

Abstract

We analyze the results of several experimental runs submitted for the TREC 2008 Legal Track. In the Ad Hoc task, we found that rank-based merging of vector results with the reference Boolean results produced a statistically significant increase in mean F1@K and Recall@B compared to just using the reference Boolean results. In the Relevance Feedback task, we found that the investigated relevance feedback technique, when merged with the reference Boolean results, produced some substantial increases in Recall@Br without any substantial decreases on individual topics.

Bibtex
@inproceedings{DBLP:conf/trec/Tomlinson08,
    author = {Stephen Tomlinson},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Experiments with the Negotiated Boolean Queries of the {TREC} 2008 Legal Track},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/open-text.legal.rev.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Tomlinson08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Lidan Wang, Douglas W. Oard

Abstract

The vocabulary of the TREC Legal OCR collection is noisy and huge. Standard techniques for improving retrieval performance such as content-based query expansion are ineffective for such document collection. In our work, we focused on exploiting metadata using blind relevance feedback, iterative improvement from the reference Boolean run, and the effects of using terms from different topic fields for automatic query formulation. This paper describes our methodologies and results.

Bibtex
@inproceedings{DBLP:conf/trec/WangO08,
    author = {Lidan Wang and Douglas W. Oard},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Query Expansion for Noisy Legal Documents},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/umd-cp.legal.rev.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangO08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Jianqiang Wang, Ying Sun, Omar Mukhtar, Rohini K. Srihari

Abstract

In the TREC 2008, the team from the State University of New York at Buffalo participated in the Legal track and the Blog track. For the Legal track, we worked on the interactive search task using the Web-based Legacy Tobacco Document Library Boolean search system. Our experiment achieved reasonable precision but suffered significantly from low recall. These results, together with the appealing and adjudication results, suggest that the concept of document relevance in legal e-discovery deserve further investigation. For the Blog distillation task, our official runs were based on a reduced document model in which only text from several most content-bearing fields were indexed. This approach indeed yielded encouraging retrieval effectiveness while significantly decreasing the index size. We also studied query independence/dependence and link-based features for finding relevant feeds. For the Blog opinion and polarity tasks, we mainly investigated the usefulness of opinionated words contained in the SentiGI lexicon. Our experiment results showed that the effectiveness of the technique is quite limited, indicating other more sophisticated techniques are needed.

Bibtex
@inproceedings{DBLP:conf/trec/WangSMS08,
    author = {Jianqiang Wang and Ying Sun and Omar Mukhtar and Rohini K. Srihari},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} 2008 at the University at Buffalo: Legal and Blog Track},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/suny-buffalo.legal.blog.rev.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangSMS08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Pitt@TREC08: An Initial Study of Collaborative Information Behavior in E-Discovery

Zhen Yue, Jon Walker, Yi-Ling Lin, Daqing He

Abstract

The University of Pittsburgh team participated in the interactive task of Legal Track in TREC 2008. We designed an experiment to investigate into the collaborative information behavior (CIB) of the group of people working on e-discovery task provided by Legal Track in TREC 2008. Through the study, we identified three major characteristics of CIB in e-discovery. 1) Frequent communication among participants 2) division of labor is common; and 3) “awareness” is important among participants. Based on these insights, we also propose a set of essential technologies and functions for retrieval systems that support CIB.

Bibtex
@inproceedings{DBLP:conf/trec/YueWLH08,
    author = {Zhen Yue and Jon Walker and Yi{-}Ling Lin and Daqing He},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Pitt@TREC08: An Initial Study of Collaborative Information Behavior in E-Discovery},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/upittsburgh.legal.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/YueWLH08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Ying Zhang, Falk Scholer, Andrew Turpin

Abstract

This paper reports on the participation of RMIT university in the 2008 TREC Legal Track Ad Hoc task. OCR errors can corrupt the document view formed by an information retrieval system, and substantially hinder the successful retrieval of relevant documents for user queries. In previous research, the presence of errors in OCR text was observed to lead to unstable and unpredictable retrieval effectiveness. In this study, we investigate the effects of OCR error minimization — through de-hyphenation of terms, and the removal of corrupted or “noise” terms — on retrieval performance. Our results indicate that removing noise terms can lead to significant savings in terms of index size.

Bibtex
@inproceedings{DBLP:conf/trec/ZhangST08,
    author = {Ying Zhang and Falk Scholer and Andrew Turpin},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{RMIT} University at {TREC} 2008: Legal Track},
    booktitle = {Proceedings of The Seventeenth Text REtrieval Conference, {TREC} 2008, Gaithersburg, Maryland, USA, November 18-21, 2008},
    series = {{NIST} Special Publication},
    volume = {500-277},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2008},
    url = {http://trec.nist.gov/pubs/trec17/papers/rmit.legal.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ZhangST08.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}