Skip to content

Runs - Million Query 2008

000cos

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: 000cos
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: eaee6a666589c1c54311820d58c1f2da
  • Run description: Lemur cos (baseline run)

000klabs

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: 000klabs
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: 78b30daa9d12709ccb5267a48bc70f30
  • Run description: Lemur KL "abs" (baseline run)

000okapi

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: 000okapi
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: 2137d5596cd3a94704a39c0130b251f5
  • Run description: Lemur Okapi

000tfidfBM25

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: 000tfidfBM25
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: b310c9eb1f154b10327b01acbd9d6967
  • Run description: Lemur tfidf "BM25" (baseline run)

000tfidfLOG

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: 000tfidfLOG
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: da4ab663de9c918fdff41c76caf0628a
  • Run description: Lemur tfidf "LOG" (baseline run)

dxrun

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: dxrun
  • Participant: I3S_Group_of_ICT
  • Track: Million Query
  • Year: 2008
  • Submission: 6/12/2008
  • Type: automatic
  • MD5: 14d6b67dbef68e63bd27d7091b41fb9f
  • Run description: We use Wikipedia as a resources to identify entities in a query, then add term dependency features for each query. The term dependency features are actually ordered phrases. Indri search engine is used for index and retrieval.

hedge0

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: hedge0
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/23/2008
  • Type: automatic
  • MD5: 6bcfb117d3f1d022dfef25fe955a7ea1
  • Run description: metasearch over Lemur retrieval runs

ind25QLnST08

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: ind25QLnST08
  • Participant: uMass
  • Track: Million Query
  • Year: 2008
  • Submission: 6/16/2008
  • Type: automatic
  • MD5: d4cb7ac37c12ec6cccc89ba548fdc226
  • Run description: Indri2.5, no stopping

indri25DM08

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: indri25DM08
  • Participant: uMass
  • Track: Million Query
  • Year: 2008
  • Submission: 6/19/2008
  • Type: automatic
  • MD5: f969758aad6b02356a6066b06a936857
  • Run description: dependency model, indri25

indriLowMu08

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: indriLowMu08
  • Participant: uMass
  • Track: Million Query
  • Year: 2008
  • Submission: 6/16/2008
  • Type: automatic
  • MD5: 1351a6695ed9f04d3e4e4ac6823f0516
  • Run description: Indri2.5, low mu, no stopping

indriQLST08

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: indriQLST08
  • Participant: uMass
  • Track: Million Query
  • Year: 2008
  • Submission: 6/13/2008
  • Type: automatic
  • MD5: 001f3a4bb21863aab271365770745409
  • Run description: Indri2.6, stopped, basic queries

lsi150dyn

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: lsi150dyn
  • Participant: ARSC08
  • Track: Million Query
  • Year: 2008
  • Submission: 6/17/2008
  • Type: automatic
  • MD5: 0e8446f1e8c4739259b98d8e303054dd
  • Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. Each topic was projected into the term vector space. The host collections were ranked relative to each topic-vector using a 150-rank LSI-reduced host-by-term matrix. The topic was run against the top 50 ranked hosts with Lucene and the ranked results from each host were merged using a standard result set merge algorithm.

lsi150stat

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: lsi150stat
  • Participant: ARSC08
  • Track: Million Query
  • Year: 2008
  • Submission: 6/13/2008
  • Type: automatic
  • MD5: 4a563b1e4cabf963fc3c37122cbb994d
  • Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. The 10,000 topics were represented as a "big topic" in the term space. The host collections were ranked relative to this static "big topic" using an 150-rank LSI-reduced host-by-term matrix and the 10,000 topics were run against the top 50 most relevant hosts and the results merged into a single ranked list.

LucDeflt

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: LucDeflt
  • Participant: ibm-haifa
  • Track: Million Query
  • Year: 2008
  • Submission: 6/16/2008
  • Type: automatic
  • MD5: 50b3fe2d5bbb82f30bd64c42e020d762
  • Run description: Lucene run with default settings () Default similarity - that is default doc length normalization = 1/sqrt(num tokens) and default tf = sqrt(term freq) () Single field for all searchable text () No special query parts (no phrases and no proximity scoring) () Default OR operator.

LucLpTfS

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: LucLpTfS
  • Participant: ibm-haifa
  • Track: Million Query
  • Year: 2008
  • Submission: 6/15/2008
  • Type: automatic
  • MD5: 28ceacdc343b97c88c5efbe6fea92f6a
  • Run description: Lucene run with proximity and phrase scoring, doc length normalization by Lucene's sweet-spot similarity, and tf normalization by average term frequency.

mpiimq0801

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: mpiimq0801
  • Participant: mpi-d5
  • Track: Million Query
  • Year: 2008
  • Submission: 6/16/2008
  • Type: automatic
  • MD5: 332fb27c57069b84bda953d18b999947
  • Run description: standard BM25, no stemming, standard stopword removal

neumsfilt

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: neumsfilt
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: c1443723f7d45bc5a1aefdc76ac65f2d
  • Run description: Baseline + Commercial Search Engine (MS Live) ============================================= Phase 1 run each query through MS Search download top 10 documents and snippets (Documents always downloaded as html, if pdf, the cached version was downloaded, in some cases documents were skipped because unavailable) Phase 2 for each query let feedback_text = query + snippet + text of all top 10 documents from MS Search for each word compute score = feedback_prob * ( log(feedback_prob) - k*log(collection_prob) ) where feedback_prob = prob of word in feedback text (from search engine) Phrase 3 Run okapi for query terms. For each document that okapi considers, if it contains more than some thresold of words identified as interesting from the MS live feedback set, push that document up by giving it an additonal score of +1000 Uses Lucene toolkit.

neuMSRF

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: neuMSRF
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: 66389c972f777502c348275180e8c1c0
  • Run description: Uses MSLIVE on the WWW to generate a query "context" like pseudo-relev feedback (50 terms selected). Then tfidf formula is used only for feedback terms (original query terms are not used) against GOV2 index. Uses Lucene toolkit.

neustbl

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: neustbl
  • Participant: neu
  • Track: Million Query
  • Year: 2008
  • Submission: 6/18/2008
  • Type: automatic
  • MD5: 5a1f6290225c8718c4ef38ecf1e0bdb5
  • Run description: Description BM25 on Lucene toolkit, with modifications =================== query preprocessing remove stop words stem remove numbers unless query becomes empty remove one letter words remove words with doc. freq. of zero intersection_list = list of documents, each document contains all query words if size_of(intersection_list) < threshold then sort query words according to their document frequencies (ascending) list = keep taking words from sorted list until sum of their doc. frequencies becomes > threshold union_list = list of documents, each document contains at least on word from the above list doc_list = intersection_list + union_list rank documents in doc_list according to Okapi BM 25 take top 1000 documents && output Uses Lucene toolkit.

sabmq08a1

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: sabmq08a1
  • Participant: sabir.buckley
  • Track: Million Query
  • Year: 2008
  • Submission: 6/17/2008
  • Type: automatic
  • MD5: e58a66919f2e7151a15d310367e64edd
  • Run description: very basic smart lnu-ltu run

sabmq08b1

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: sabmq08b1
  • Participant: sabir.buckley
  • Track: Million Query
  • Year: 2008
  • Submission: 6/19/2008
  • Type: automatic
  • MD5: eee52d82e1c8ed340d9cb66fcf177ccd
  • Run description: SMART blind feedback, top 25 docs, add 20 terms.

txrun

Participants | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: txrun
  • Participant: I3S_Group_of_ICT
  • Track: Million Query
  • Year: 2008
  • Submission: 6/16/2008
  • Type: automatic
  • MD5: be60dacd3dcf5556056ce7b9e3cfe970
  • Run description: We use Wikipedia as a resources to identify entities in a query, then expand each query with ten terms(if there are any) based on Wikipedia. The term selection procedure makes use of semantic similarity measure proposed by resnik(1996). Terms are ranked in descending order according to the similarity between a term and the who query. Indri search engine is used for indexing and retrieval.

vsmdyn

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: vsmdyn
  • Participant: ARSC08
  • Track: Million Query
  • Year: 2008
  • Submission: 6/23/2008
  • Type: automatic
  • MD5: 08b0d4871ea75d6cdd1fde575882be58
  • Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. Each topic was projected into the term vector space. The host collections were ranked relative to each topic-vector using vector space model cosines with the host-by-term matrix. The topic was run against the top 50 ranked hosts with Lucene and the ranked results from each host were merged using a standard result set merge algorithm.

vsmstat

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: vsmstat
  • Participant: ARSC08
  • Track: Million Query
  • Year: 2008
  • Submission: 6/13/2008
  • Type: automatic
  • MD5: 212b6b5dda93bbd140a29268f36462c8
  • Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. The 10,000 topics were represented as a "big topic" in the term space. The host collections were ranked relative to this static "big topic" using standard vector space model (VSM) cosines with the host-by-term matrix; the 10,000 topics were run against the top 50 most relevant hosts and the results merged into a single ranked list.

vsmstat07

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

  • Run ID: vsmstat07
  • Participant: ARSC08
  • Track: Million Query
  • Year: 2008
  • Submission: 6/16/2008
  • Type: automatic
  • MD5: 90dee6ba36e6db583e49daecc93d00a2
  • Run description: This run is to test the sensitivy of the static "big topic" vector space ranking (restriction) of host collections searched. In short, the hosts were chosen by ranking them with respect to a "big topic" from the 10000 TREC 2007 mq topics using the vector space model as in vsmstat. Then the TREC 2008 mq topics were run against this collection of hosts (that was tuned for the 2007 topics).