Runs - Million Query 2008¶
000cos¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: 000cos
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
eaee6a666589c1c54311820d58c1f2da
- Run description: Lemur cos (baseline run)
000klabs¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: 000klabs
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
78b30daa9d12709ccb5267a48bc70f30
- Run description: Lemur KL "abs" (baseline run)
000okapi¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: 000okapi
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
2137d5596cd3a94704a39c0130b251f5
- Run description: Lemur Okapi
000tfidfBM25¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: 000tfidfBM25
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
b310c9eb1f154b10327b01acbd9d6967
- Run description: Lemur tfidf "BM25" (baseline run)
000tfidfLOG¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: 000tfidfLOG
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
da4ab663de9c918fdff41c76caf0628a
- Run description: Lemur tfidf "LOG" (baseline run)
dxrun¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: dxrun
- Participant: I3S_Group_of_ICT
- Track: Million Query
- Year: 2008
- Submission: 6/12/2008
- Type: automatic
- MD5:
14d6b67dbef68e63bd27d7091b41fb9f
- Run description: We use Wikipedia as a resources to identify entities in a query, then add term dependency features for each query. The term dependency features are actually ordered phrases. Indri search engine is used for index and retrieval.
hedge0¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: hedge0
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/23/2008
- Type: automatic
- MD5:
6bcfb117d3f1d022dfef25fe955a7ea1
- Run description: metasearch over Lemur retrieval runs
ind25QLnST08¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: ind25QLnST08
- Participant: uMass
- Track: Million Query
- Year: 2008
- Submission: 6/16/2008
- Type: automatic
- MD5:
d4cb7ac37c12ec6cccc89ba548fdc226
- Run description: Indri2.5, no stopping
indri25DM08¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: indri25DM08
- Participant: uMass
- Track: Million Query
- Year: 2008
- Submission: 6/19/2008
- Type: automatic
- MD5:
f969758aad6b02356a6066b06a936857
- Run description: dependency model, indri25
indriLowMu08¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: indriLowMu08
- Participant: uMass
- Track: Million Query
- Year: 2008
- Submission: 6/16/2008
- Type: automatic
- MD5:
1351a6695ed9f04d3e4e4ac6823f0516
- Run description: Indri2.5, low mu, no stopping
indriQLST08¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: indriQLST08
- Participant: uMass
- Track: Million Query
- Year: 2008
- Submission: 6/13/2008
- Type: automatic
- MD5:
001f3a4bb21863aab271365770745409
- Run description: Indri2.6, stopped, basic queries
lsi150dyn¶
Participants
| Proceedings
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: lsi150dyn
- Participant: ARSC08
- Track: Million Query
- Year: 2008
- Submission: 6/17/2008
- Type: automatic
- MD5:
0e8446f1e8c4739259b98d8e303054dd
- Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. Each topic was projected into the term vector space. The host collections were ranked relative to each topic-vector using a 150-rank LSI-reduced host-by-term matrix. The topic was run against the top 50 ranked hosts with Lucene and the ranked results from each host were merged using a standard result set merge algorithm.
lsi150stat¶
Participants
| Proceedings
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: lsi150stat
- Participant: ARSC08
- Track: Million Query
- Year: 2008
- Submission: 6/13/2008
- Type: automatic
- MD5:
4a563b1e4cabf963fc3c37122cbb994d
- Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. The 10,000 topics were represented as a "big topic" in the term space. The host collections were ranked relative to this static "big topic" using an 150-rank LSI-reduced host-by-term matrix and the 10,000 topics were run against the top 50 most relevant hosts and the results merged into a single ranked list.
LucDeflt¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: LucDeflt
- Participant: ibm-haifa
- Track: Million Query
- Year: 2008
- Submission: 6/16/2008
- Type: automatic
- MD5:
50b3fe2d5bbb82f30bd64c42e020d762
- Run description: Lucene run with default settings () Default similarity - that is default doc length normalization = 1/sqrt(num tokens) and default tf = sqrt(term freq) () Single field for all searchable text () No special query parts (no phrases and no proximity scoring) () Default OR operator.
LucLpTfS¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: LucLpTfS
- Participant: ibm-haifa
- Track: Million Query
- Year: 2008
- Submission: 6/15/2008
- Type: automatic
- MD5:
28ceacdc343b97c88c5efbe6fea92f6a
- Run description: Lucene run with proximity and phrase scoring, doc length normalization by Lucene's sweet-spot similarity, and tf normalization by average term frequency.
mpiimq0801¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: mpiimq0801
- Participant: mpi-d5
- Track: Million Query
- Year: 2008
- Submission: 6/16/2008
- Type: automatic
- MD5:
332fb27c57069b84bda953d18b999947
- Run description: standard BM25, no stemming, standard stopword removal
neumsfilt¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: neumsfilt
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
c1443723f7d45bc5a1aefdc76ac65f2d
- Run description: Baseline + Commercial Search Engine (MS Live) ============================================= Phase 1 run each query through MS Search download top 10 documents and snippets (Documents always downloaded as html, if pdf, the cached version was downloaded, in some cases documents were skipped because unavailable) Phase 2 for each query let feedback_text = query + snippet + text of all top 10 documents from MS Search for each word compute score = feedback_prob * ( log(feedback_prob) - k*log(collection_prob) ) where feedback_prob = prob of word in feedback text (from search engine) Phrase 3 Run okapi for query terms. For each document that okapi considers, if it contains more than some thresold of words identified as interesting from the MS live feedback set, push that document up by giving it an additonal score of +1000 Uses Lucene toolkit.
neuMSRF¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: neuMSRF
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
66389c972f777502c348275180e8c1c0
- Run description: Uses MSLIVE on the WWW to generate a query "context" like pseudo-relev feedback (50 terms selected). Then tfidf formula is used only for feedback terms (original query terms are not used) against GOV2 index. Uses Lucene toolkit.
neustbl¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: neustbl
- Participant: neu
- Track: Million Query
- Year: 2008
- Submission: 6/18/2008
- Type: automatic
- MD5:
5a1f6290225c8718c4ef38ecf1e0bdb5
- Run description: Description BM25 on Lucene toolkit, with modifications =================== query preprocessing remove stop words stem remove numbers unless query becomes empty remove one letter words remove words with doc. freq. of zero intersection_list = list of documents, each document contains all query words if size_of(intersection_list) < threshold then sort query words according to their document frequencies (ascending) list = keep taking words from sorted list until sum of their doc. frequencies becomes > threshold union_list = list of documents, each document contains at least on word from the above list doc_list = intersection_list + union_list rank documents in doc_list according to Okapi BM 25 take top 1000 documents && output Uses Lucene toolkit.
sabmq08a1¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: sabmq08a1
- Participant: sabir.buckley
- Track: Million Query
- Year: 2008
- Submission: 6/17/2008
- Type: automatic
- MD5:
e58a66919f2e7151a15d310367e64edd
- Run description: very basic smart lnu-ltu run
sabmq08b1¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: sabmq08b1
- Participant: sabir.buckley
- Track: Million Query
- Year: 2008
- Submission: 6/19/2008
- Type: automatic
- MD5:
eee52d82e1c8ed340d9cb66fcf177ccd
- Run description: SMART blind feedback, top 25 docs, add 20 terms.
txrun¶
Participants
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: txrun
- Participant: I3S_Group_of_ICT
- Track: Million Query
- Year: 2008
- Submission: 6/16/2008
- Type: automatic
- MD5:
be60dacd3dcf5556056ce7b9e3cfe970
- Run description: We use Wikipedia as a resources to identify entities in a query, then expand each query with ten terms(if there are any) based on Wikipedia. The term selection procedure makes use of semantic similarity measure proposed by resnik(1996). Terms are ranked in descending order according to the similarity between a term and the who query. Indri search engine is used for indexing and retrieval.
vsmdyn¶
Participants
| Proceedings
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: vsmdyn
- Participant: ARSC08
- Track: Million Query
- Year: 2008
- Submission: 6/23/2008
- Type: automatic
- MD5:
08b0d4871ea75d6cdd1fde575882be58
- Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. Each topic was projected into the term vector space. The host collections were ranked relative to each topic-vector using vector space model cosines with the host-by-term matrix. The topic was run against the top 50 ranked hosts with Lucene and the ranked results from each host were merged using a standard result set merge algorithm.
vsmstat¶
Participants
| Proceedings
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: vsmstat
- Participant: ARSC08
- Track: Million Query
- Year: 2008
- Submission: 6/13/2008
- Type: automatic
- MD5:
212b6b5dda93bbd140a29268f36462c8
- Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. The 10,000 topics were represented as a "big topic" in the term space. The host collections were ranked relative to this static "big topic" using standard vector space model (VSM) cosines with the host-by-term matrix; the 10,000 topics were run against the top 50 most relevant hosts and the results merged into a single ranked list.
vsmstat07¶
Participants
| Proceedings
| Summary (mtc)
| Summary (statAP)
| Appendix
- Run ID: vsmstat07
- Participant: ARSC08
- Track: Million Query
- Year: 2008
- Submission: 6/16/2008
- Type: automatic
- MD5:
90dee6ba36e6db583e49daecc93d00a2
- Run description: This run is to test the sensitivy of the static "big topic" vector space ranking (restriction) of host collections searched. In short, the hosts were chosen by ranking them with respect to a "big topic" from the 10000 TREC 2007 mq topics using the vector space model as in vsmstat. Then the TREC 2008 mq topics were run against this collection of hosts (that was tuned for the 2007 topics).