Proceedings - Knowledge Base Acceleration 2014¶

Evaluating Stream Filtering for Entity Profile Updates in TREC 2012, 2013, and 2014¶

John R. Frank, Max Kleiman-Weiner, Daniel A. Roberts, Ellen M. Voorhees, Ian Soboroff

Paper: http://trec.nist.gov/pubs/trec23/papers/overview-kba.pdf

Abstract

The Knowledge Base Acceleration (KBA) track ran in TREC 2012, 2013, and 2014 as an entity- centric filtering evaluation. This track evaluates systems that filter a time-ordered corpus for documents and slot fills that would change an entity profile in a predefined list of entities. Compared with the 2012 and 2013 evaluations, the 2014 evaluation introduced several refinements, including high-quality community metadata from running Raytheon/BBN's Serif named entity recognizer, sentence parser, and relation extractor on 579,838,246 English documents in the corpus. We also expanded the query entities to be primarily long-tail entities that lacked Wikipedia profiles. We simplified the SSF scoring, and also added a third task component for highlighting creative systems that used the KBA data. A successful KBA system must do more than resolve the meaning of entity mentions by linking documents to the KB: it must also distinguish novel “vitally” relevant documents and slot fills that would change a target entity's profile. This combines thinking from natural language understanding (NLU) and information retrieval (IR). Filtering tracks in TREC have typically used queries based on topics described by a set of keyword queries or short descriptions, and annotators have generated relevance judgments based on their personal interpretation of the topic. For TREC 2014, we selected a set of filter topics based on people, organizations, and facilities in the region between Seattle, Washington, and Vancouver, British Columbia: 86 people, 16 organizations, and 7 facilities. Assessors judged ~30k documents, which included most documents that mention a name from a handcrafted list of surface form names of the 109 target entities. TREC teams were provided with all of the ground truth data divided into training and evaluation data. We present peak macro-averaged F_1 scores for all run submissions. High scoring systems used a variety of approaches, including feature engineering around linguistic structures, names of related entities, and various types of classifiers. Top scoring systems achieved F_1 scores in the high-50s. We present results for a baseline system that performs in the low-40s. We discuss key lessons learned that motivate future tracks at the end of the paper.

Bibtex

@inproceedings{DBLP:conf/trec/FrankKRVS14,
    author = {John R. Frank and Max Kleiman{-}Weiner and Daniel A. Roberts and Ellen M. Voorhees and Ian Soboroff},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {Evaluating Stream Filtering for Entity Profile Updates in {TREC} 2012, 2013, and 2014},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/overview-kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/FrankKRVS14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-overview}
}

IRIT at TREC KBA 2014¶

Rafik Abbes, Karen Pinel-Sauvagnat, Nathalie Hernandez, Mohand Boughanem

Participant: IRIT
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-IRIT_kba.pdf
Runs: IRIT-alpha_10_0.25 | IRIT-alpha_10_0.5 | IRIT-alpha_10_0.75 | IRIT-alpha_50_0.25 | IRIT-alpha_50_0.5 | IRIT-alpha_50_0.75 | IRIT-alpha_100_0.25 | IRIT-alpha_100_0.5 | IRIT-alpha_100_0.75 | IRIT-VLM_10 | IRIT-VLM_50 | IRIT-VULMBuzz_10 | IRIT-VULMBuzz_50 | IRIT-ULM_10 | IRIT-ULM_50 | IRIT-alpha_50_0.75T | IRIT-ULMBuzz_50_0.5T | IRIT-ULMBuzz_50_0.7T | IRIT-VULMBuz_50_0.7T | IRIT-VULMBuz_50_0.5T

Abstract

This paper describes the IRIT lab participation to the Vital Filtering task (also known as Cumulative Citation Recommendation) of the TREC 2014 Knowledge Base Acceleration Track. This task aims at identifying vital documents containing timely new information that should help a human to update the profile of the target entity (e.g., Wikipedia page of the entity). In this work, we evaluate two factors that could detect vitality. The first one uses a Language Model to learn vitality from a sample of vital documents, and the second leverages the bursts of documents in the stream. Obtained results are presented and discussed.

Bibtex

@inproceedings{DBLP:conf/trec/AbbesPHB14,
    author = {Rafik Abbes and Karen Pinel{-}Sauvagnat and Nathalie Hernandez and Mohand Boughanem},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {{IRIT} at {TREC} {KBA} 2014},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-IRIT\_kba.pdf},
    timestamp = {Wed, 03 Feb 2021 08:31:24 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AbbesPHB14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-IRIT}
}

Use of Time-Aware Language Model in Entity Driven Filtering System¶

Vincent Bouvier, Patrice Bellot

Participant: LSIS
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-LSIS_kba.pdf
Runs: LSIS-AF_NU_MCE | LSIS-AF_NU_SE | LSIS-AF_NU_TSE | LSIS-AF_NU_VOE | LSIS-AF_UD_MCE | LSIS-AF_UD_SE | LSIS-AF_UD_TSE | LSIS-AF_UD_VOE | LSIS-AF_US_MCE | LSIS-AF_US_SE | LSIS-AF_US_TSE | LSIS-AF_US_VOE

Abstract

Tracking entities, so that new or important information about that entities are caught, is a real challenge and has many applications (e.g., information monitoring, marketing,...). We are interesting in how to represent an entity profile to fulfill two purposes: 1. entity detection and disambiguation, 2. novelty and importance quantification. We propose an entity profile, which uses two language models. First, the Reference Language Model (RLM), which is mainly used for disambiguation. Second, we propose a formalization of a Time-Aware Language Model, which is used for novelty detection. To rank documents, we propose a semi-supervised classification approach which uses meta-features computed on documents using entity profiles and time series.

Bibtex

@inproceedings{DBLP:conf/trec/BouvierB14,
    author = {Vincent Bouvier and Patrice Bellot},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {Use of Time-Aware Language Model in Entity Driven Filtering System},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-LSIS\_kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BouvierB14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-LSIS}
}

Distributed Non-Parametric Representations for Vital Filtering: UW at TREC KBA 2014¶

Ignacio Cano, Sameer Singh, Carlos Guestrin

Participant: UW
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-UW_kba.pdf
Runs: UW-basic_single | UW-basic_multitask | UW-embedding_pos | UW-embedding_comb | UW-mean_stat | UW-clu_stat_a08 | UW-mean_dyn_g06 | UW-clu_dyn_a08_g04 | UW-clu_dyn_nv_e | UW-f_basic_single | UW-f_basic_multi | UW-f_emb_comb | UW-f_mean_stat | UW-f_mean_dyn | UW-f_clust_stat | UW-f_clust_dyn | UW-f_emb_pos

Abstract

Identifying documents that contain timely and vi- tal information for an entity of interest, a task known as vital filtering, has become increasingly important with the availability of large document collections. To efficiently filter such large text corpora in a streaming manner, we need to compactly represent previously observed entity contexts, and quickly estimate whether a new document contains novel information. Existing approaches to modeling contexts, such as bag of words, latent semantic indexing, and topic models, are limited in several respects: they are unable to handle streaming data, do not model the underlying topic of each document, suffer from lexical sparsity, and/or do not accurately estimate temporal vitalness. In this paper, we introduce a word embedding-based non-parametric representation of entities that addresses the above limitations. The word embeddings provide accurate and compact summaries of observed entity contexts, further described by topic clusters that are estimated in a non-parametric manner. Additionally, we associate a staleness measure with each entity and topic cluster, dynamically estimating their temporal relevance. This approach of using word embeddings, non-parametric clustering, and staleness provides an efficient yet appropriate representation of entity contexts for the streaming setting, enabling accurate vital filtering.

Bibtex

@inproceedings{DBLP:conf/trec/CanoSG14,
    author = {Ignacio Cano and Sameer Singh and Carlos Guestrin},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {Distributed Non-Parametric Representations for Vital Filtering: {UW} at {TREC} {KBA} 2014},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-UW\_kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CanoSG14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-UW}
}

K2U at TREC 2014 KBA Track¶

Shun Kawahara, Kuniaki Uehara, Kazuhiro Seki

Participant: KobeU
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-KobeU_kba.pdf
Runs: KobeU-exact_match | KobeU-ccr_03 | KobeU-ccr_08

Abstract

Kobe University and Konan University (K2U) collaborated on the vital filtering task of the 2014 TREC KBA track. This paper describes our proposed system developed on the distributed and fault-tolerant realtime computation system, Apache Storm, and reports the results obtained for our submitted runs. The remainder of this paper is structured as follows: Section II briefly introduces Apache Storm and its components, Section III describes our proposed system for the vital filtering task, Section IV reports and discusses the results for our submitted runs, and Section V concludes this paper with a brief summary.

Bibtex

@inproceedings{DBLP:conf/trec/KawaharaUS14,
    author = {Shun Kawahara and Kuniaki Uehara and Kazuhiro Seki},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {{K2U} at {TREC} 2014 {KBA} Track},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-KobeU\_kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/KawaharaUS14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-KobeU}
}

SCU at TREC 2014 Knowledge Base Acceleration Track¶

Hung Nguyen, Yi Fang

Participant: SCU
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-SCU_kba.pdf
Runs: SCU-ssf_1 | SCU-ssf_2 | SCU-ssf_3 | SCU-ssf_4 | SCU-ssf_5 | SCU-ssf_6 | SCU-ssf_7 | SCU-ssf_8 | SCU-ssf_9 | SCU-ssf_10 | SCU-ssf_11 | SCU-ssf_12 | SCU-ssf_13 | SCU-ssf_14

Abstract

In this paper, we present our system we developed at Santa Clara University to address the SSF task in TREC KBA 2014. We used the pattern matching method to extract slot values for interested entities from relevant passages. We improved the approach we used last year to enhance the performance. Our system consists of the following steps: processing filtered corpus, retrieving relevant passages, pattern matching, computing scores, and generate results.

Bibtex

@inproceedings{DBLP:conf/trec/NguyenF14,
    author = {Hung Nguyen and Yi Fang},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {{SCU} at {TREC} 2014 Knowledge Base Acceleration Track},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-SCU\_kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/NguyenF14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-SCU}
}

BUPT_PRIS at TREC 2014 Knowledge Base Acceleration Track¶

Yuanyuan Qi, Ye Xu, Dongxu Zhang, Weiran Xu

Participant: BUPT_PRIS
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-BUPT_PRIS_kba.pdf
Runs: BUPT_PRIS-pris_baseline | BUPT_PRIS-pris_svm | BUPT_PRIS-pris_NN | BUPT_PRIS-pris_rf | BUPT_PRIS-ssf1 | BUPT_PRIS-ssf2

Abstract

This paper describes the system in Vital Filtering and Streaming Slot Filling task of TREC 2014 Knowledge Base Acceleration Track. In the Vital Filtering task, The PRIS system focuses attention on query expansion and similarity calculation. The system uses DBpedia as external source data to do query expansion and generates directional documents to calculate similarities with candidate worth citing documents. In the Streaming Slot Filling task, The BUPT_PRIS system utilizes a pattern learning method to do relation extraction and slot filling. Patterns of regular slots which mostly are same to TAC-KBP slots are learned from KBP Slot Filling corpus. Other slots are manually picked up some training seeds for those slot types that KBP didn't contain to use bootstrapping method.

Bibtex

@inproceedings{DBLP:conf/trec/QiXZX14,
    author = {Yuanyuan Qi and Ye Xu and Dongxu Zhang and Weiran Xu},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {BUPT{\_}PRIS at {TREC} 2014 Knowledge Base Acceleration Track},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-BUPT\_PRIS\_kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/QiXZX14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-BUPT_PRIS}
}

The University of Illinois' Graduate School of Library and Information Science at TREC 2014¶

Garrick Sherman, Miles Efron, Craig Willis

Participant: uiucGSLIS
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-uiucGSLIS-federated-kba.pdf
Runs: uiucGSLIS-baseline_sf | uiucGSLIS-baseline_rm3 | uiucGSLIS-length_sf | uiucGSLIS-length_rm3 | uiucGSLIS-prevdocs_sf | uiucGSLIS-prevdocs_rm3 | uiucGSLIS-pdsrc_rm3 | uiucGSLIS-pdsrc_sf | uiucGSLIS-pdverb_sf | uiucGSLIS-pdverb_rm3 | uiucGSLIS-sourcelen_rm3 | uiucGSLIS-sourcelen_sf | uiucGSLIS-verbsource_rm3 | uiucGSLIS-verbsource_sf

Abstract

The University of Illinois' Graduate School of Library and Information Science (uiucGSLIS) participated in TREC's Federated Web (FedWeb) and Knowledge Base Acceleration (KBA) tracks in 2014. Specifically, we submitted runs for the FedWeb resource selection and KBA cumulative citation recommendation (CCR) tasks.

Bibtex

@inproceedings{DBLP:conf/trec/ShermanEW14,
    author = {Garrick Sherman and Miles Efron and Craig Willis},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {The University of Illinois' Graduate School of Library and Information Science at {TREC} 2014},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-uiucGSLIS-federated-kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ShermanEW14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-uiucGSLIS}
}

BIT and Purdue at TREC-KBA-CCR Track 2014¶

Jingang Wang, Ning Zhang, Zhiwei Zhang, Dandan Song, Luo Si, Lejian Liao

Participant: BIT_Purdue
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-BIT-Purdue_kba.pdf
Runs: BIT_Purdue-baseline | BIT_Purdue-profile | BIT_Purdue-labeled | BIT_Purdue-GlobalRank | BIT_Purdue-BinaryRank | BIT_Purdue-GlobalClassU | BIT_Purdue-GlobalClassV | BIT_Purdue-GlobalClassV1

Abstract

This report summarizes our participation at KBA-CCR track in TREC 2014. Our submissions are generated in two steps: (1) Filtering a candidate documents collection from the stream corpus for a set of target entities; and (2) Estimating the relevance levels between candidate documents and target entities. Three kinds of approaches are employed in the second step, including query expansion, classi cation and learning to rank. Query expansion is an unsupervised baseline by combining an entity and its related entities as a query to retrieve its relevant documents. Query expansion performs considerably well in vital + useful scenario. It's not difficult to lter a relevant document set from the stream corpus. However, in vital only scenario, supervised approaches are more powerful than query expansion in identifying vital documents for target entities. Our results reveal that learning to rank approaches are more suitable for CCR with current evaluation methodology.

Bibtex

@inproceedings{DBLP:conf/trec/WangZZSSL14,
    author = {Jingang Wang and Ning Zhang and Zhiwei Zhang and Dandan Song and Luo Si and Lejian Liao},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {{BIT} and Purdue at {TREC-KBA-CCR} Track 2014},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-BIT-Purdue\_kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangZZSSL14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-BIT_Purdue}
}

WHU at TREC KBA Vital Filtering Track 2014¶

Chuan Wu, Wei Lu, Pengcheng Zhou, Xiaohua Feng

Participant: WHU_IRGroup
Paper: http://trec.nist.gov/pubs/trec23/papers/pro-WHU_IRGroup_kba.pdf
Runs: WHU_IRGroup-baseline | WHU_IRGroup-BM_TF | WHU_IRGroup-BM_TF_3 | WHU_IRGroup-CUSTOM_TF_FIXED

Abstract

This paper describes the WHU IRLAB participation to the Vital Filtering task of the TREC 2014 Knowledge Base Acceleration Track. In this task, we implemented a system to detect vital documents that could be used for a human editor to update or create the profile of an entity. Our approach is to view the problem as a classification problem and use Stanford NLP Toolkit to extract necessary information. Various kinds of features are leveraged to classify documents to three classes, i.e. vital, useful, and non-useful (garbage or neutral). We submitted four runs using different combinations of features. The results are presented and discussed.

Bibtex

@inproceedings{DBLP:conf/trec/WuLZF14,
    author = {Chuan Wu and Wei Lu and Pengcheng Zhou and Xiaohua Feng},
    editor = {Ellen M. Voorhees and Angela Ellis},
    title = {{WHU} at {TREC} {KBA} Vital Filtering Track 2014},
    booktitle = {Proceedings of The Twenty-Third Text REtrieval Conference, {TREC} 2014, Gaithersburg, Maryland, USA, November 19-21, 2014},
    series = {{NIST} Special Publication},
    volume = {500-308},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2014},
    url = {http://trec.nist.gov/pubs/trec23/papers/pro-WHU\_IRGroup\_kba.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WuLZF14.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-308.kba-WHU_IRGroup}
}