Runs - Million Query 2008¶

000cos¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: 000cos
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: eaee6a666589c1c54311820d58c1f2da
Run description: Lemur cos (baseline run)

000klabs¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: 000klabs
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: 78b30daa9d12709ccb5267a48bc70f30
Run description: Lemur KL "abs" (baseline run)

000okapi¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: 000okapi
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: 2137d5596cd3a94704a39c0130b251f5
Run description: Lemur Okapi

000tfidfBM25¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: 000tfidfBM25
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: b310c9eb1f154b10327b01acbd9d6967
Run description: Lemur tfidf "BM25" (baseline run)

000tfidfLOG¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: 000tfidfLOG
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: da4ab663de9c918fdff41c76caf0628a
Run description: Lemur tfidf "LOG" (baseline run)

dxrun¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: dxrun
Participant: I3S_Group_of_ICT
Track: Million Query
Year: 2008
Submission: 6/12/2008
Type: automatic
MD5: 14d6b67dbef68e63bd27d7091b41fb9f
Run description: We use Wikipedia as a resources to identify entities in a query, then add term dependency features for each query. The term dependency features are actually ordered phrases. Indri search engine is used for index and retrieval.

hedge0¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: hedge0
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/23/2008
Type: automatic
MD5: 6bcfb117d3f1d022dfef25fe955a7ea1
Run description: metasearch over Lemur retrieval runs

ind25QLnST08¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: ind25QLnST08
Participant: uMass
Track: Million Query
Year: 2008
Submission: 6/16/2008
Type: automatic
MD5: d4cb7ac37c12ec6cccc89ba548fdc226
Run description: Indri2.5, no stopping

indri25DM08¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: indri25DM08
Participant: uMass
Track: Million Query
Year: 2008
Submission: 6/19/2008
Type: automatic
MD5: f969758aad6b02356a6066b06a936857
Run description: dependency model, indri25

indriLowMu08¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: indriLowMu08
Participant: uMass
Track: Million Query
Year: 2008
Submission: 6/16/2008
Type: automatic
MD5: 1351a6695ed9f04d3e4e4ac6823f0516
Run description: Indri2.5, low mu, no stopping

indriQLST08¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: indriQLST08
Participant: uMass
Track: Million Query
Year: 2008
Submission: 6/13/2008
Type: automatic
MD5: 001f3a4bb21863aab271365770745409
Run description: Indri2.6, stopped, basic queries

lsi150dyn¶

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

Run ID: lsi150dyn
Participant: ARSC08
Track: Million Query
Year: 2008
Submission: 6/17/2008
Type: automatic
MD5: 0e8446f1e8c4739259b98d8e303054dd
Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. Each topic was projected into the term vector space. The host collections were ranked relative to each topic-vector using a 150-rank LSI-reduced host-by-term matrix. The topic was run against the top 50 ranked hosts with Lucene and the ranked results from each host were merged using a standard result set merge algorithm.

lsi150stat¶

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

Run ID: lsi150stat
Participant: ARSC08
Track: Million Query
Year: 2008
Submission: 6/13/2008
Type: automatic
MD5: 4a563b1e4cabf963fc3c37122cbb994d
Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. The 10,000 topics were represented as a "big topic" in the term space. The host collections were ranked relative to this static "big topic" using an 150-rank LSI-reduced host-by-term matrix and the 10,000 topics were run against the top 50 most relevant hosts and the results merged into a single ranked list.

LucDeflt¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: LucDeflt
Participant: ibm-haifa
Track: Million Query
Year: 2008
Submission: 6/16/2008
Type: automatic
MD5: 50b3fe2d5bbb82f30bd64c42e020d762
Run description: Lucene run with default settings () Default similarity - that is default doc length normalization = 1/sqrt(num tokens) and default tf = sqrt(term freq) () Single field for all searchable text () No special query parts (no phrases and no proximity scoring) () Default OR operator.

LucLpTfS¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: LucLpTfS
Participant: ibm-haifa
Track: Million Query
Year: 2008
Submission: 6/15/2008
Type: automatic
MD5: 28ceacdc343b97c88c5efbe6fea92f6a
Run description: Lucene run with proximity and phrase scoring, doc length normalization by Lucene's sweet-spot similarity, and tf normalization by average term frequency.

mpiimq0801¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: mpiimq0801
Participant: mpi-d5
Track: Million Query
Year: 2008
Submission: 6/16/2008
Type: automatic
MD5: 332fb27c57069b84bda953d18b999947
Run description: standard BM25, no stemming, standard stopword removal

neumsfilt¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: neumsfilt
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: c1443723f7d45bc5a1aefdc76ac65f2d
Run description: Baseline + Commercial Search Engine (MS Live) ============================================= Phase 1 run each query through MS Search download top 10 documents and snippets (Documents always downloaded as html, if pdf, the cached version was downloaded, in some cases documents were skipped because unavailable) Phase 2 for each query let feedback_text = query + snippet + text of all top 10 documents from MS Search for each word compute score = feedback_prob * ( log(feedback_prob) - k*log(collection_prob) ) where feedback_prob = prob of word in feedback text (from search engine) Phrase 3 Run okapi for query terms. For each document that okapi considers, if it contains more than some thresold of words identified as interesting from the MS live feedback set, push that document up by giving it an additonal score of +1000 Uses Lucene toolkit.

neuMSRF¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: neuMSRF
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: 66389c972f777502c348275180e8c1c0
Run description: Uses MSLIVE on the WWW to generate a query "context" like pseudo-relev feedback (50 terms selected). Then tfidf formula is used only for feedback terms (original query terms are not used) against GOV2 index. Uses Lucene toolkit.

neustbl¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: neustbl
Participant: neu
Track: Million Query
Year: 2008
Submission: 6/18/2008
Type: automatic
MD5: 5a1f6290225c8718c4ef38ecf1e0bdb5
Run description: Description BM25 on Lucene toolkit, with modifications =================== query preprocessing remove stop words stem remove numbers unless query becomes empty remove one letter words remove words with doc. freq. of zero intersection_list = list of documents, each document contains all query words if size_of(intersection_list) < threshold then sort query words according to their document frequencies (ascending) list = keep taking words from sorted list until sum of their doc. frequencies becomes > threshold union_list = list of documents, each document contains at least on word from the above list doc_list = intersection_list + union_list rank documents in doc_list according to Okapi BM 25 take top 1000 documents && output Uses Lucene toolkit.

sabmq08a1¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: sabmq08a1
Participant: sabir.buckley
Track: Million Query
Year: 2008
Submission: 6/17/2008
Type: automatic
MD5: e58a66919f2e7151a15d310367e64edd
Run description: very basic smart lnu-ltu run

sabmq08b1¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: sabmq08b1
Participant: sabir.buckley
Track: Million Query
Year: 2008
Submission: 6/19/2008
Type: automatic
MD5: eee52d82e1c8ed340d9cb66fcf177ccd
Run description: SMART blind feedback, top 25 docs, add 20 terms.

txrun¶

Participants | Summary (mtc) | Summary (statAP) | Appendix

Run ID: txrun
Participant: I3S_Group_of_ICT
Track: Million Query
Year: 2008
Submission: 6/16/2008
Type: automatic
MD5: be60dacd3dcf5556056ce7b9e3cfe970
Run description: We use Wikipedia as a resources to identify entities in a query, then expand each query with ten terms(if there are any) based on Wikipedia. The term selection procedure makes use of semantic similarity measure proposed by resnik(1996). Terms are ranked in descending order according to the similarity between a term and the who query. Indri search engine is used for indexing and retrieval.

vsmdyn¶

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

Run ID: vsmdyn
Participant: ARSC08
Track: Million Query
Year: 2008
Submission: 6/23/2008
Type: automatic
MD5: 08b0d4871ea75d6cdd1fde575882be58
Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. Each topic was projected into the term vector space. The host collections were ranked relative to each topic-vector using vector space model cosines with the host-by-term matrix. The topic was run against the top 50 ranked hosts with Lucene and the ranked results from each host were merged using a standard result set merge algorithm.

vsmstat¶

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

Run ID: vsmstat
Participant: ARSC08
Track: Million Query
Year: 2008
Submission: 6/13/2008
Type: automatic
MD5: 212b6b5dda93bbd140a29268f36462c8
Run description: The documents from each unique host name were grouped and indexed as separate collections. Each host collection was represented as a "big document" in a vector space of approximately 400,000 terms. The 10,000 topics were represented as a "big topic" in the term space. The host collections were ranked relative to this static "big topic" using standard vector space model (VSM) cosines with the host-by-term matrix; the 10,000 topics were run against the top 50 most relevant hosts and the results merged into a single ranked list.

vsmstat07¶

Participants | Proceedings | Summary (mtc) | Summary (statAP) | Appendix

Run ID: vsmstat07
Participant: ARSC08
Track: Million Query
Year: 2008
Submission: 6/16/2008
Type: automatic
MD5: 90dee6ba36e6db583e49daecc93d00a2
Run description: This run is to test the sensitivy of the static "big topic" vector space ranking (restriction) of host collections searched. In short, the hosts were chosen by ranking them with respect to a "big topic" from the 10000 TREC 2007 mq topics using the vector space model as in vsmstat. Then the TREC 2008 mq topics were run against this collection of hosts (that was tuned for the 2007 topics).