Runs - Legal 2009¶

ADI2009Topic204¶

Results | Participants | Input | Appendix

Run ID: ADI2009Topic204
Participant: ADI2009
Track: Legal
Year: 2009
Submission: 9/16/2009
Task: interactive
MD5: 73ce9033b09a9a402168416717161f11
Run description: Topic 204 / we used ORA e-discovery tool with FAST search; key term and boolean searches to pull results; document sampling to test and refine. Results file contains 80,114 document IDs.

buffalo¶

Results | Participants | Proceedings | Input | Appendix

Run ID: buffalo
Participant: SUNY_Buffalo
Track: Legal
Year: 2009
Submission: 9/16/2009
Task: interactive
MD5: abbd9de463be40926670c27439bf5cb5
Run description: We combined results of about 15 very specific queries with results of one generic query

CGSHBCK¶

Results | Participants | Proceedings | Input | Appendix

Run ID: CGSHBCK
Participant: Cleary_Backstop
Track: Legal
Year: 2009
Submission: 9/16/2009
Task: interactive
MD5: 3b533d93ce88aaf22cbce0c16be9f5a9
Run description: Cleary - Backstop team results for topics 201, 204, 206, and 207. Topic 206 designed to identify small pool including key documents with very high threshold of responsiveness and minimal manpower. Topics 201, 204, and 207 same as for CGSHBCK1 and CGSHBCK2.

CGSHBCK1¶

Results | Participants | Proceedings | Input | Appendix

Run ID: CGSHBCK1
Participant: Cleary_Backstop
Track: Legal
Year: 2009
Submission: 9/17/2009
Task: interactive
MD5: d3d24bc63300b7cf70eec453778dae18
Run description: Cleary - Backstop team results for topics 201, 204, 206, and 207. Topic 206 designed to identify small pool including key documents with high threshold of responsiveness (but not as high as in CGSHBCK) and minimal manpower. Topics 201, 204, and 207 same as for CGSHBCK and CGSHBCK2.

CGSHBCK2¶

Results | Participants | Proceedings | Input | Appendix

Run ID: CGSHBCK2
Participant: Cleary_Backstop
Track: Legal
Year: 2009
Submission: 9/17/2009
Task: interactive
MD5: cae4b25bd68b654eb781d05ebe788088
Run description: Cleary - Backstop team results for topics 201, 204, 206, and 207. Topic 206 designed to reduce pool of potentially responsive documents with minimal manpower. Topics 201, 204, and 207 same as for CGSHBCK and CGSHBCK1.

clearwell01¶

Results | Participants | Proceedings | Input | Appendix

Run ID: clearwell01
Participant: Clearwell09
Track: Legal
Year: 2009
Submission: 9/30/2009
Task: interactive
MD5: 8bfba4ccd7adf120e3392eb6926df8ff
Run description: Clearwell E-Discovery platform v4.5 is used for topic 205 request production.

Clearwell09i¶

Results | Participants | Proceedings | Input | Appendix

Run ID: Clearwell09i
Participant: Clearwell09
Track: Legal
Year: 2009
Submission: 9/18/2009
Task: interactive
MD5: 602e60a19b05aec0e05589dd27be8917
Run description: Clearwell E-Discovery platform V4.5 is used for the execution of the legal track interactive task topic 201 and topic 202.

CompCustIT09¶

Results | Participants | Proceedings | Input | Appendix

Run ID: CompCustIT09
Participant: ZLTech
Track: Legal
Year: 2009
Submission: 9/17/2009
Task: interactive
MD5: 1fdb6c40d02aa983c5c759be7bb307c5
Run description: In this submission, the emails were reduplicated to approximately 3 million emails and distributed across 104 custodian mailboxes. The reduplicated emails combined the text extraction for the message body with the native files for the attachments to create IETF RFC-2822 MIME emails. The custodians were further associated with titles and department information. The team identified and prioritized the most likely custodians with relevant information based on the titles and department. Once the custodians had been identified, all their email was pulled and made available for review. A couple of other custodians were identified through a enterprise subject-based search without using title and department to see if the custodian identification method could miss important custodians with substantial volumes of relevant email. A variety of search and analytics techniques were used in conjunction with guidance from the Topic Authority. In this submission, the top 4 prioritized custodians based on title and department are included. The email's use an ID we term the "TREC Parent ID" which is the DocId after removing TREC identified duplicates. This ID is the broadest ID used in this study. The DocID is termed the "TREC ID." Since the DocID is not unique in a fully reduplicated set, a further ID, the "JWID" was created as a unique identifier to assist in performing the analysis. Although the top 4 custodians are presented, more custodians were analyzed and additional information can be submitted. 4 out of 104 possible custodians is 3.8% of the user population while in the overall set, the 104 custodians represents 0.48% of the approximately 22,000 Enron employees.

CompEntrIT09¶

Results | Participants | Proceedings | Input | Appendix

Run ID: CompEntrIT09
Participant: ZLTech
Track: Legal
Year: 2009
Submission: 9/17/2009
Task: interactive
MD5: 992033c26c216d8d0b2ed3c595ec50a4
Run description: In this submission, the emails were reduplicated to approximately 3 million emails and distributed across 104 custodian mailboxes. The reduplicated emails combined the text extraction for the message body with the native files for the attachments to create IETF RFC-2822 MIME emails. This created a scenario similar to a eDiscovery scenario before processing. The emails were ingested into the ZL eDiscovery review platform where a variety of search and analytics capabilities were applied to the data including full text search, wildcard search, auto-classification, concept search, faceted search etc. These techniques were used in conjunction with interactive guidance from the Topic Authority.

EmcRun1¶

Run ID: EmcRun1
Participant: EMC_CMA_RD
Track: Legal
Year: 2009
Submission: 8/4/2009
Type: manual
Task: batch
MD5: f61f6e6d341f78f79180bd432fda346b
Run description: For different reasons, the time available for this experiment was /much/ shorter than desirable. Therefore the experiment had to be kept as simple as possible. A list of noteworthy points about our run: - relevance calculation was based upon TF/IDF; - a list of ~500 stop-words was filtered out of indexing; - indexing and searching was case-insensitive; - the final boolean queries were expanded manually after consulting the request and the complaint.

Equivio205R1¶

Results | Participants | Proceedings | Input | Appendix

Run ID: Equivio205R1
Participant: Equivio
Track: Legal
Year: 2009
Submission: 9/30/2009
Task: interactive
MD5: e651fd38adc60ef5ccda35f01d3b6e1f
Run description: The Equivio run used Equivio>Relevance, an expert-guided system for assessing document relevance. The system feeds statistically selected samples of documents to an expert (an attorney familiar with the case), who marks each sample as relevant or not. The expert's decisions are used to train the software to estimate document relevance. Using a statistical model to determine when the software training process has optimized, the system then calculates graduated relevance scores for each document in the collection.

Equivio207R1¶

Results | Participants | Proceedings | Input | Appendix

Run ID: Equivio207R1
Participant: Equivio
Track: Legal
Year: 2009
Submission: 9/16/2009
Task: interactive
MD5: 2d0b6bc45927e51c38efef1374fdf9e9
Run description: The Equivio run used Equivio>Relevance, an expert-guided system for assessing document relevance. The system feeds statistically selected samples of documents to an expert (an attorney familiar with the case), who marks each sample as relevant or not. The expert's decisions are used to train the software to estimate document relevance. Using a statistical model to determine when the software training process has optimized, the system then calculates graduated relevance scores for each document in the collection.

H52009¶

Results | Participants | Input | Appendix

Run ID: H52009
Participant: H5_2009
Track: Legal
Year: 2009
Submission: 9/16/2009
Task: interactive
MD5: 2b7f48bbd450819c1ab9b8cd3b44ac2e
Run description: H5 has submitted an assessment of documents our system has identified as responsive to topic 204 of TREC's Legal Track Interactive Task. Our system identified 2994 documents as responsive to topic 204. The H5 system combined human expertise and advanced search and information retrieval technologies to assess the totality of the corpus under investigation.

IntegreonB¶

Results | Participants | Input | Appendix

Run ID: IntegreonB
Participant: IDS_TREC
Track: Legal
Year: 2009
Submission: 9/29/2009
Task: interactive
MD5: aaa1393eb4fd2b21536021c1ad662c08
Run description: We considered the entire message unit responsive if any part of the unit was responsive. We have also listed all items from all responsive message units, all emails and attachments from each "family". We are submitting for Topic 205, only.

LogikIT09t¶

Results | Participants | Input | Appendix

Run ID: LogikIT09t
Participant: Logik
Track: Legal
Year: 2009
Submission: 9/16/2009
Task: interactive
MD5: 70afb120e099c816ebc3674477cff4bf
Run description: Documents were classified with a Naive Bayes classifier which was trained from a set of internally tagged documents.

otL09F¶

Run ID: otL09F
Participant: ot
Track: Legal
Year: 2009
Submission: 8/3/2009
Type: manual
Task: batch
MD5: b56113a2fbd451cf4307896b09245156
Run description: pure relevance feedback run based on forming a query from a random sample of the known relevant documents of size less than 10000 bytes; no topic fields were used; the K value was set to the greater of the retrospective optimal K value for F1 and the estRelL09.append K value, plus 10 percent; the Kh values were just taken from estRelL09.append

otL09frwF¶

Run ID: otL09frwF
Participant: ot
Track: Legal
Year: 2009
Submission: 8/4/2009
Type: manual
Task: batch
MD5: 1f887b974c9b6ea2233b4d102e883432
Run description: rrf-based fusion of feedback (otL09F weight 3), ranked final Boolean (weight 3), request text vector (otL09rvl weight 2), and vector of final Boolean terms (weight 1); the K value was set to the greater of the retrospective optimal K value for F1 and the estRelL09.append K value, plus 10 percent; the Kh values were just taken from estRelL09.append

otL09rvl¶

Run ID: otL09rvl
Participant: ot
Track: Legal
Year: 2009
Submission: 8/3/2009
Type: manual
Task: batch
MD5: 549525f606cf4b68e9d27c7b6fe91f67
Run description: baseline run (no feedback); vector run based on request text terms; English inflections were matched; common instruction words (e.g. "please", "produce", "documents") were manually removed; K includes rsv of 200 or more, Kh includes rsv of 225 or more

pittsis09¶

Results | Participants | Proceedings | Input | Appendix

Run ID: pittsis09
Participant: pitt_sis
Track: Legal
Year: 2009
Submission: 9/16/2009
Task: interactive
MD5: 79056aa82bed83758e2edbf4841d5184
Run description: We designed an experiment to investigate into the information seeking behavior of users when conducting e-discovery task. Our focus is on the collaboration among searchers. We observed an expert with legal background and an information retrieval expert working collaboratively on topic 201. How they collaborate with each other to complete the task and what the characteristics of the collaborative information behavior (CIB) are.

ucedlsi¶

Run ID: ucedlsi
Participant: URSINUS
Track: Legal
Year: 2009
Submission: 7/31/2009
Type: manual
Task: batch
MD5: 70b8f7c7c3f624e5cdec60d470a9c0ab
Run description: This run is distributed EDLSI. The indexed dataset is divided into 81 pieces, and Essential Dimensions of Latent Semantic Indexing (EDLSI), a weighted sum of the result from Latent Sematic Indexing (LSI) and vector space information retreival (IR) is applied to each. The scores are then compiled and sorted. The K values are chosen to represent approximately the number of documents at which the number common documents between the three runs is maximized. K_h is chosen at the approximate point where the document scores drop off.

uclsi¶

Run ID: uclsi
Participant: URSINUS
Track: Legal
Year: 2009
Submission: 7/31/2009
Type: manual
Task: batch
MD5: d4403e59e9490171509bbbe75c044cfd
Run description: This run is LSI with folding-in. The indexed dataset is divided into 81 pieces, and Latent Sematic Indexing (LSI) is performed on the first piece. The remaining 80 pieces are folded in, and the scores computed and sorted. The K values are chosen to represent approximately the number of documents at which the number common documents between the three runs is maximized. K_h is chosen at the approximate point where the document scores drop off.

ucscra¶

Run ID: ucscra
Participant: URSINUS
Track: Legal
Year: 2009
Submission: 7/31/2009
Type: manual
Task: batch
MD5: 410c29f24931dee1e3d349ebd6e445e2
Run description: This run is SCRA-based distributed LSI. The indexed dataset is divided into 81 pieces, and each piece is further divided into 40 pieces. Latent Sematic Indexing (LSI) is applied to each piece, but using the Sparse Column-Row Approximation (SCRA) instead of the Singular Value Decomposition (SVD) which is traditionally used. The scores are then compiled and sorted. The K values are chosen to represent approximately the number of documents at which the number common documents between the three runs is maximized. K_h is chosen at the approximate point where the document scores drop off.

watlint¶

Results | Participants | Proceedings | Input | Appendix

Run ID: watlint
Participant: Waterloo
Track: Legal
Year: 2009
Submission: 9/15/2009
Task: interactive
MD5: bb4ffeeb0b23d2626a9db5a67372e563
Run description: Interactive search and judging followed by active machine learning with human reviewer in loop. Recall estimated by fitting censored normal distribution to machine learning scores for responsive documents, factoring in an estimate of 90% inter-assessor agreement on non-relevant documents (derived from past experience). Precision estimated assuming 70% as an upper bound on human agreement (based on past experience), reduced for topic 203 due to poor fit of the learning model, coupled with uncertainty in review. Topic 207 estimates are confounded (upwards) by the fact that about 2/3 of the responsive documents are vacuous ".URL" attachments, which were handled as a special case.

watlogistic¶

Run ID: watlogistic
Participant: Waterloo
Track: Legal
Year: 2009
Submission: 7/27/2009
Type: automatic
Task: batch
MD5: 4e92883f7a58ffddaa00304efb189ef8
Run description: Logistic regression using all training examples as described for watrrf. All training examples were used; no cross-validation. K values were estimated from a separate run (not submitted) using the same method with 2-fold cross-validation.

watrrf¶

Run ID: watrrf
Participant: Waterloo
Track: Legal
Year: 2009
Submission: 7/27/2009
Type: automatic
Task: batch
MD5: 1ee6dedb83c95b181bae81d07c7a1355
Run description: Reciprocal rank fusion of several ranking methods: BM25 relevance feedback, Naive Bayes, online logistic regression, batch logistic regression. 2-fold cross validation (splitting examples into equal test and validation sets) was used to determine K. Topic was not used at all. Training examples and features as per watlogistic.

watstack¶

Run ID: watstack
Participant: Waterloo
Track: Legal
Year: 2009
Submission: 7/27/2009
Type: automatic
Task: batch
MD5: 64094f4c7dc9b2dd996145adac10e0b7
Run description: Same as watrrf, but classifiers were stacked using logistic regression and 2-fold cross validation.