Skip to content

Proceedings - Spoken Document Retrieval 1998

Retrieval Of Broadcast News Documents With the THISL System

Dave Abberley, Steve Renals, Gary D. Cook, Anthony J. Robinson

Abstract

This paper describes the THISL system that participated in the TREC-7 evaluation, Spoken Document Retrieval (SDR) Track, and presents the results obtained, together with some analysis. The THISL system is based on the ABBOT speech recognition system and the thislIR text retrieval system. In this evaluation we were concerned with investigating the suitability for SDR of a recognizer running at less than ten times realtime, the use of multiple transcriptions and word graphs, the effect of simple query expansion algorithms and the effect of varying standard IR parameters.

Bibtex
@inproceedings{DBLP:conf/trec/AbberleyRCR98,
    author = {Dave Abberley and Steve Renals and Gary D. Cook and Anthony J. Robinson},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Retrieval Of Broadcast News Documents With the {THISL} System},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {128--137},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/thisl-trec7.pdf.gz},
    timestamp = {Wed, 07 Jul 2021 16:44:22 +0200},
    biburl = {https://dblp.org/rec/conf/trec/AbberleyRCR98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

INQUERY and TREC-7

James Allan, James P. Callan, Mark Sanderson, Jinxi Xu, Steven Wegmann

Abstract

This year the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts participated in only four of the tracks that were part of the TREC-T workshop. We worked on ad-hoc retrieval, filtering, VLC, and the SDR track. This report covers the work done on each track successively. We start with a discussion of IR tools that were broadly applied in our work.

Bibtex
@inproceedings{DBLP:conf/trec/AllanCSXW98,
    author = {James Allan and James P. Callan and Mark Sanderson and Jinxi Xu and Steven Wegmann},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{INQUERY} and {TREC-7}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {148--163},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/umass-trec7.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/AllanCSXW98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

TNO TREC7 Site Report: SDR and Filtering

Rudie Ekkelenkamp, Wessel Kraaij, David A. van Leeuwen

Abstract

This paper reports about experiments in the CLIR and filtering track, carried out at TNO-TPD and TNO-TM. TNO-TPD is also a member of the TwentyOne consortium and as such participated in the AdHoc task and the CLIR track. These experiments are discussed in a separate paper (cf. [Hiemstra and Kraaij98]) elsewhere in this volume.

Bibtex
@inproceedings{DBLP:conf/trec/EkkelenkampKL98,
    author = {Rudie Ekkelenkamp and Wessel Kraaij and David A. van Leeuwen},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{TNO} {TREC7} Site Report: {SDR} and Filtering},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {455--462},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/tnotrec7.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/EkkelenkampKL98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

AT&T at TREC-7

Amit Singhal, John Choi, Donald Hindle, David D. Lewis, Fernando C. N. Pereira

Abstract

This year AT&T participated in the ad-hoc task and the Filtering, SDR, and VLC tracks. Most of our effort for TREC-7 was concentrated on SDR and VLC tracks. On the filtering track, we tested a preliminary version of a text classification toolkit that we have been developing over the last year. In the ad-hoc task, we introduce a new tf-factor in our term weighting scheme and use a simplified retrieval algorithm. The same weighting scheme and algorithm are used in the SDR and the VLC tracks. The results from the SDR track show that retrieval from automatic transcriptions of speech is quite competitive with doing retrieval from human transcriptions. Our experiments indicate that document expansion can be used to further improve retrieval from automatic transcripts. Results of filtering track are in line with our expectations given the early developmental stage of our classification software. The results of VLC track do not support our hypothesis that retrieval lists from a distributed search can be effectively merged using only the initial part of the documents.

Bibtex
@inproceedings{DBLP:conf/trec/SinghalCHLP98,
    author = {Amit Singhal and John Choi and Donald Hindle and David D. Lewis and Fernando C. N. Pereira},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {AT{\&}T at {TREC-7}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {186--198},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/att.pdf.gz},
    timestamp = {Fri, 30 Aug 2019 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/SinghalCHLP98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Experiments in Spoken Document Retrieval at CMU

Matthew Siegler, Adam L. Berger, Michael J. Witbrock, Alexander G. Hauptmann

Abstract

We describe our submission to the TREC-7 Spoken Document Retrieval (SDR) track and the speech recognition and information retrieval engines. We present SDR evaluation results and a brief analysis. A few developments are also described in greater detail including: A new, probabilistic retrieval engine based on language models. A new, TFIDF-based weighting function that incorporates word error probability. The use of a simple confidence estimate for word probability based on speech recognition lattices. Although improvements over a development test set were promising, the new techniques failed to yield significant gains in the evaluation test set.

Bibtex
@inproceedings{DBLP:conf/trec/SieglerBWH98,
    author = {Matthew Siegler and Adam L. Berger and Michael J. Witbrock and Alexander G. Hauptmann},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Experiments in Spoken Document Retrieval at {CMU}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {264--270},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/CMU-TREC7-SDR.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/SieglerBWH98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

TREC-7 Experiments at the University of Maryland

Douglas W. Oard

Abstract

The University of Maryland participated in three TREC-7 tasks: ad hoc retrieval, cross-language retrieval, and spoken document retrieval. The principal focus of the work was evaluation of merging techniques for cross-language text retrieval from mixed language collections. The results show that biasing the merging strategy in favor of documents in the query language can be helpful. Ad hoc and spoken document retrieval results are also presented.

Bibtex
@inproceedings{DBLP:conf/trec/Oard98,
    author = {Douglas W. Oard},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{TREC-7} Experiments at the University of Maryland},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {477--481},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/umdfinal.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/Oard98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Experiments in Spoken Document Retrieval at DERA-SRU

Peter Nowell

Abstract

A small amount of internal funding allowed DERA-SRU to participate in the TREC-7 SDR evaluations for the first time this year. Since we had almost no experience of entering this or related NIST evaluations (e.g. ARPA HUB-4 LVCSR) there was a rather steep learning curve along with intense development of the experimental infrastructure. The intention was to generate a base for future participation and to build upon this using experience gained from related work on topic spotting. To this end, a straightforward (i.e. non-optimised) speech recogniser was used to generate transcripts and retrieval was performed using the okapi [6,9] search engine. Previous work on topic spotting [7] suggested that term expansion using a semantic network (in this case wordnet [2,3]) might be useful. This hypothesis appeared to be supported by preliminary work on TREC-6 SDR data which yielded text (i.e. R1) results that were comparable with the best achieved elsewhere.

Bibtex
@inproceedings{DBLP:conf/trec/Nowell98,
    author = {Peter Nowell},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Experiments in Spoken Document Retrieval at {DERA-SRU}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {298--307},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/pnstrec.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/Nowell98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

TREC 7 Ad Hoc, Speech, and Interactive tracks at MDS/CSIRO

Michael Fuller, Marcin Kaszkiel, Dongki Kim, Corinna Ng, John Robertson, Ross Wilkinson, Mingfang Wu, Justin Zobel

Abstract

For the 1998 round of TREC, the MDS group, long-term participants at the conference, jointly participated with newcomers CSIRO. Together we completed runs in three tracks: ad-hoc, interactive, and speech.

Bibtex
@inproceedings{DBLP:conf/trec/FullerKKNRWWZ98,
    author = {Michael Fuller and Marcin Kaszkiel and Dongki Kim and Corinna Ng and John Robertson and Ross Wilkinson and Mingfang Wu and Justin Zobel},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{TREC} 7 Ad Hoc, Speech, and Interactive tracks at {MDS/CSIRO}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {404--413},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/mds.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/FullerKKNRWWZ98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Text Retrieval via Semantic Forests: TREC7

Gregory D. Henderson, Patrick Schone, Thomas H. Crystal

Abstract

In the second year of the use of Semantic Forests in TREC, we have raised our 30-document average precision in the automatic Ad Hoc task to 27% from 19% last year. We also contributed a significant number of unique relevant documents to the judgement pool [3]. Our mean average precisions on the SDR task are roughly the median performance for that task [4]. The Semantic Forests algorithm was originally developed by Schone and Nelson [1] for labeling topics in text and transcribed speech. Semantic Forests uses an electronic dictionary to make a tree for each word in a text document. The root of the tree is the word from the document, the first branches are the words in the definition of the root word, the next branches are the words in the definitions of the words in the first branches, and so on. The words in the trees are tagged by part of speech and given weights based on statistics gathered during training. Finally, the trees are merged into a scored list of words. The premise is that words in common between trees will be reinforced and represent 'topics' present in the document. With minor modifications, queries are treated as documents. Seven major changes were made in developing this year's system from last year's. (1) A number of pre-processing steps which were performed last year (such as identifying multi-word units) were incorporated into Semantic Forests. (2) A part of speech tagger was added, allowing Semantic Forests to use this additional information. (3) Semantic Forests distinguishes between queries and documents this year, since our experiments indicated they needed to be treated differently. (4) Only the first three letters of words which do not occur in the dictionary are retained, instead of the entire word. (5) A parameter directs Semantic Forests to break each document into segments containing at most a set number of words, typically 500. 6) The algorithms used by Semantic Forests to assign and combine word weights have been improved. (7) Quasi-relevance feedback was implemented and evaluated.

Bibtex
@inproceedings{DBLP:conf/trec/HendersonSC98,
    author = {Gregory D. Henderson and Patrick Schone and Thomas H. Crystal},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Text Retrieval via Semantic Forests: {TREC7}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {516--527},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/nsa-rev.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/HendersonSC98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Spoken Document Retrieval For TREC-7 At Cambridge University

Sue E. Johnson, Pierre Jourlin, Gareth L. Moore, Karen Sparck Jones, Philip C. Woodland

Abstract

This paper presents work done at Cambridge University, on the TREC-7 Spoken Document Retrieval (SDR) Track. The broadcast news audio was transcribed using a 2-pass gender-dependent HTK speech recog-niser which ran at 50 times real time and gave an overall word error rate of 24.8%, the lowest in the track. The Okapi-based retrieval engine used in TREC-6 by the City/Cambridge University collaboration was supplemented by improving the stop-list, adding a bad-spelling mapper and stemmer exceptions list, adding word-pair information, integrating part-of-speech weighting on query terms and including some pre-search statistical expansion. The final system gave an average precision of 0.4817 on the reference and 0.4509 on the automatic tran-scription, with the R-precision being 0.4603 and 0.4330 respectively. The paper also presents results on a new set of 60 queries with assessments for the TREC-6 test document data used for development pur-poses, and analyses the relationship between recognition accuracy, as defined by a pre-processed term error rate, and retrieval performance for both sets of data.

Bibtex
@inproceedings{DBLP:conf/trec/JohnsonJMJW98,
    author = {Sue E. Johnson and Pierre Jourlin and Gareth L. Moore and Karen Sparck Jones and Philip C. Woodland},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Spoken Document Retrieval For {TREC-7} At Cambridge University},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {138--147},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/cuhtk-trec98-uspaper.pdf.gz},
    timestamp = {Wed, 05 May 2021 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/JohnsonJMJW98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}