February 2004
CONTENTS
Integrated Text Mining Augments Westlaw
International Online Launches Commercial Arbitration Product
Litigator Becomes a Reality
R&D, BIN, and Novus Collaborate on Duplicate Document Detection for News
EDITED BY
Technical Communications
For further information about this site, contact TLR.techwire@thomson.com

Back to the Home Page



R&D, BIN, and Novus Collaborate on Duplicate Document Detection for News
by Jack G. Conrad and Peter Jackson, TLR Research & Development

What's the closest thing to déjà vu one can experience in an online context? Have you ever run a search on Westlaw or in some other online environment and discovered that your results contained numerous duplicate entries? Beyond déjà vu, it can be downright frustrating. Now that Westlaw offers users nearly 60 million news documents, the déjà vu experience has become an increasing concern to West's Business & Information News (BIN) group. Whether from copies of the same article in assorted newspapers or different versions of the same news story, the problem has become pervasive.

This problem is one reason BIN came to TLR Research & Development early last year to ask for technological assistance. Another reason for this request was that in 2002, R&D delivered a "de-duping" solution to the Novus group that efficiently and effectively identified identical duplicate documents from data sets like Dialog's (including news, science, and patent documents). The challenge this time was to provide BIN customers with similar (i.e., non-identical) duplicates that are identified and grouped together.

R&D was already investigating an algorithm to solve this problem and needed only a challenge like that presented by BIN to test its solution. Thanks to some additional collaboration from a team of reference attorneys, R&D and BIN were able to construct a test collection (or human-generated "gold standard") of similar but not necessarily identical duplicate documents, using guidelines provided by a set of 25 Thomson West Advisory Board members. The Board then identified what degree of terminology overlap would group truly similar documents, and R&D subsequently conducted a test phase that incorporated this input and measured the algorithm's performance against real search results.

The results were both technically sound and satisfying to customers. Tests conducted on a version of the algorithm developed by the Novus group show that on average it takes under one second to "de-dup" a result set, including 1,000 or more documents. Meanwhile the algorithm captures 97 percent of the duplicates identified by human experts. R&D nonetheless recommended that documents found to be duplicates not be completely removed but simply de-emphasized in the display, so that they are still available for researchers who wish to inspect all versions of a story.

The key to the deduping algorithm is the way it checks pairs of news articles for similarity. Each comparison relies on a document's "digital signature"—a set of features that includes a vector of characteristic terms together with the article's publication date and length. What is remarkable about digital signature technology is that it is both compact (requires minimal storage space) and fast (performs efficient comparisons). The other key to the algorithm's success is its ability to apply rules associated with document dates and length to screen document pairs and restrict the use of more costly comparisons.

This de-duping technology for BIN is scheduled to go live on Novus in the first quarter of 2004.

Online Services Application Development Team

Team members, from left:
Back: Jack Conrad (TLR R&D), Bruce Getting (Novus)
Front: Cindy Schriber (FindLaw, formerly BIN), Jane Lund (Novus), Jie Lin (Novus)
Not Pictured: Jeremy Leese (Novus)