Skip to content

Proceedings - Chemical 2009

Overview of the TREC 2009 Chemical IR Track

Mihai Lupu, Florina Piroi, Xiangji Huang, Jianhan Zhu, John Tait

Abstract

TREC 2009 was the first year of the Chemical IR Track, which focuses on evaluation of search techniques for discovery of digitally stored information on chemical patents and academic journal articles. The track included two tasks: Prior Art (PA) and Technical Survey (TS) tasks. This paper describes how we designed the two tasks and presents the official results of eight participating groups.

Bibtex
@inproceedings{DBLP:conf/trec/LupuPHZT09,
    author = {Mihai Lupu and Florina Piroi and Xiangji Huang and Jianhan Zhu and John Tait},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2009 Chemical {IR} Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/CHEM09.OVERVIEW.pdf},
    timestamp = {Sun, 02 Oct 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/LupuPHZT09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Strategies for Effective Chemical Information Retrieval

Suleyman Cetintas, Luo Si

Abstract

We participated in the technology survey and prior art search subtasks of the TREC 2009 Chemical IR Track. This paper describes the methods developed for these two tasks. For the technology survey task, we propose a method that constructs highly structured queries to do retrieval on different fields of chemical patents and documents in a weighted way. The proposed method i) enriches these structured queries with synonyms of the chemicals that have been identified, and ii) uses simple entity recognition to extract information for increasing or decreasing weights of some terms and to filter out documents from the ranked list. For prior art search task; we propose an automated query generation method that uses all title words, and selects sets of terms from the claims, abstract and description fields of query patents to transform a query patent into a search query. From the selected terms, chemical entities are extracted and synonyms for the identified chemical entities are included from PubChem. Then structured queries are formed to do retrieval over different fields of documents with different weights. Furthermore a post-processing step is also proposed that i) filters out some of the retrieved documents from the ranked list because of date constraints and ii) utilizes the IPC similarities between query patent and its retrieved patents to re-rank the retrieved documents. Empirical results demonstrate the effectiveness of these methods in both tasks.

Bibtex
@inproceedings{DBLP:conf/trec/CetintasS09,
    author = {Suleyman Cetintas and Luo Si},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Strategies for Effective Chemical Information Retrieval},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/purdue.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CetintasS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Report on the TREC 2009 Experiments: Chemical IR Track

Julien Gobeill, Douglas Teodoro, Emilie Pasche, Patrick Ruch

Abstract

The goal of the first TREC Chemical track was to retrieve documents relevant to a given patent query, within a large collection of patents in chemistry. Regarding this objective, for the Prior Art subtask, our runs performed significantly better that runs submitted by other participating teams. Baseline retrieval methods achieved relatively poor performances (Mean Average Precision = 0.067). Query expansion, driven my chemical named entity recognition resulted in some modest improvement (+2 to 3%). Filtering based on IPC codes did not result in any significant improvement. A re-ranking strategy, based on claims only improved MAP by about 3%. The most effective gain was obtained by using patent citation patterns. Somehow similar to feed-back but restricted to citations, we used patents cited in the retrieved patents in order to boost the retrieval status value of the baseline run. This strategy led to a remarkable improvement (MAP 0.18, +168 %). Nevertheless, as official topics were sampled from the collection disregarding their creation date, our strategy happened to exploit citations of patents which were patented after the topic itself. From a user perspective, such a setting is questionable. We think that future TREC-CHEM competitions should address this issue by using patents filed as recently as possible.

Bibtex
@inproceedings{DBLP:conf/trec/GobeillTPR09,
    author = {Julien Gobeill and Douglas Teodoro and Emilie Pasche and Patrick Ruch},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Report on the {TREC} 2009 Experiments: Chemical {IR} Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/bitem.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GobeillTPR09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Patent Retrieval in Chemistry Based on Semantically Tagged Named Entities

Harsha Gurulingappa, Bernd Müller, Roman Klinger, Heinz-Theodor Mevissen, Martin Hofmann-Apitius, Juliane Fluck, Christoph M. Friedrich

Abstract

This paper reports on the work that has been conducted by Fraunhofer SCAI for Trec Chemistry (Trec-Chem) track 2009. The team of Fraunhofer SCAI participated in two tasks, namely Technology Survey and Prior Art Search. The core of the framework is an index of 1.2 million chemical patents provided as a data set by Trec. For the technology survey, three runs were submitted based on semantic dictionaries and noun phrases. For the prior art search task, several fields were introduced into the index that contained normalized noun phrases, biomedical as well as chemical entities. Altogether, 36 runs were submitted for this task that were based on automatic querying with tokens, noun phrases and entities along with different search strategies.

Bibtex
@inproceedings{DBLP:conf/trec/GurulingappaMKMHFF09,
    author = {Harsha Gurulingappa and Bernd M{\"{u}}ller and Roman Klinger and Heinz{-}Theodor Mevissen and Martin Hofmann{-}Apitius and Juliane Fluck and Christoph M. Friedrich},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Patent Retrieval in Chemistry Based on Semantically Tagged Named Entities},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/scai.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GurulingappaMKMHFF09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

DUTIR at TREC 2009: Chemical IR Track

Song Jin, Zheng Ye, Hongfei Lin

Abstract

This paper presents the DUTIR submission to TREC 2009 Chemical IR Track. This track included two tasks: Prior Art (PA) and Technical Survey (TS) tasks. We present a series of experiments on two text retrieval models, BM25 and Language Model for IR (LMIR). For Prior Art task, we focused on formulating the queries from the query patents and date filtering. Moreover, some traditional search techniques are used for Technical Survey task.

Bibtex
@inproceedings{DBLP:conf/trec/JinYL09,
    author = {Song Jin and Zheng Ye and Hongfei Lin},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{DUTIR} at {TREC} 2009: Chemical {IR} Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/dalianu.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/JinYL09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

TREC Blog and TREC Chem: A View from the Corn Fields

Yelena Mejova, Viet Ha-Thuc, Steven Foster, Christopher G. Harris, Robert J. Arens, Padmini Srinivasan

Abstract

The University of Iowa Team, participated in the blog track and the chemistry track of TREC-2009. This is our first year participating in the blog track as well as the chemistry track.

Bibtex
@inproceedings{DBLP:conf/trec/MejovaHFHAS09,
    author = {Yelena Mejova and Viet Ha{-}Thuc and Steven Foster and Christopher G. Harris and Robert J. Arens and Padmini Srinivasan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} Blog and {TREC} Chem: {A} View from the Corn Fields},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uiowa.BLOG.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MejovaHFHAS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Jay Urbain, Ophir Frieder

Abstract

For the TREC-2009 Chemical IR Track, we explore development of a distributed information retrieval system based on a dimensional data model. The indexing model supports named entity identification and aggregation of term statistics at multiple levels of patent structure including individual words, sentences, claims, descriptions, abstracts, and titles. The system was deployed across 15 Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instances and 15 Elastic Block Storage (EBS) database shards to support efficient indexing and query processing of the relatively large index generated from indexing each individual word (sans stop words) in the 100G+ collection of chemical patent documents. The query processing algorithm for technology survey search and prior art search uses information extraction techniques and locally aggregated term statistics to help disambiguate candidate entities and terms in context. Query processing for prior art search automatically generates a structured query based on the relative distinctiveness of individual terms and candidate entity phrases from the query patent's claims, abstract, and title sections. For both the technology survey and prior art search, we evaluated several probabilistic retrieval functions for integrating statistics of retrieved named entities with term statistics at multiple levels of document structure to identify relevant patents.

Bibtex
@inproceedings{DBLP:conf/trec/UrbainF09,
    author = {Jay Urbain and Ophir Frieder},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} Chemical {IR} Track 2009: {A} Distributed Dimensional Indexing Model for Chemical Patent Search},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/milwaukee.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/UrbainF09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Formulating Simple Structured Queries Using Temporal and Distributional Cues in Patents

Le Zhao, James P. Callan

Abstract

Patent prior art retrieval aims to find related publications, especially patents, which may invalidate the patent. The task exhibits its own characteristic because of the possible use of a whole patent as a query. This work focuses on the use of date fields and content fields of the query patent to formulate effective structured queries. Retrieval is performed on the collection of patents which also share the same structure as the query patent, mainly priority dates, application date, publication date and content fields. Unsurprisingly, results show that filtering using date information improves retrieval significantly. However, results also show that a careful choice of the date filter is important, given the multiple date fields existent in a patent. The actual ranking query is constructed based on word distributions of title, claims and content fields of the query patent. The overall MAP of this citation finding task is still in the lower 0.1 range. An error analysis focusing on the lower performing topics finds that the citation finding task (given publication recommend citations, which is a very similar setup as this year's prior art evaluation) can be very different from the prior art task (finding patents that invalidates the query patent). It raises the concern that just the citations included in query patents can be a biased and incomplete set of relevance judgements for the prior art task.

Bibtex
@inproceedings{DBLP:conf/trec/ZhaoC09,
    author = {Le Zhao and James P. Callan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Formulating Simple Structured Queries Using Temporal and Distributional Cues in Patents},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/cmu.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ZhaoC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

York University at TREC 2009: Chemical Track

Jiashu Zhao, Xiangji Huang, Zheng Ye, Jianhan Zhu

Abstract

Our chemical experiments mainly focus on addressing three major problems in two chemical information retrieval tasks, Technology Survey (TS) task and Prior Art (PA) task. The three problems are: (1) how to deal with chemical terminology synonyms? (2) how to deal with chemical terminology abbreviation? (3) how to deal with long queries in Prior Art (PA) task? In particular, we propose a query expansion algorithm for TS task and a keyword-selection algorithm for PA task. The Mean Average Precision (MAP) for our TS task run “york09ca07” using Algorithm 1 was 0.2519 and for our PA task run “york09caPA01” using Algorithm 2 was 0.0566. The evaluation results show that both algorithms are effective for improving retrieval performance.

Bibtex
@inproceedings{DBLP:conf/trec/ZhaoHYYZ09,
    author = {Jiashu Zhao and Xiangji Huang and Zheng Ye and Jianhan Zhu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {York University at {TREC} 2009: Chemical Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/yorku.CHEM.pdf},
    timestamp = {Sun, 02 Oct 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ZhaoHYYZ09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}