|
What's the closest thing to déjà vu one can experience in an
online context? Have you ever run a search on Westlaw or in some other
online environment and discovered that your results contained numerous
duplicate entries? Beyond déjà vu, it can be downright
frustrating. Now that Westlaw offers users nearly 60 million news
documents, the déjà vu experience has become an increasing
concern to West's Business & Information News (BIN) group. Whether
from copies of the same article in assorted newspapers or different
versions of the same news story, the problem has become pervasive.
This problem is one reason BIN came to TLR Research & Development
early last year to ask for technological assistance. Another reason for
this request was that in 2002, R&D delivered a "de-duping" solution to
the Novus group that efficiently and effectively identified identical
duplicate documents from data sets like Dialog's (including news, science,
and patent documents). The challenge this time was to provide BIN
customers with similar (i.e., non-identical) duplicates that are
identified and grouped together.
R&D was already investigating an algorithm to solve this problem
and needed only a challenge like that presented by BIN to test its
solution. Thanks to some additional collaboration from a team of reference
attorneys, R&D and BIN were able to construct a test collection (or
human-generated "gold standard") of similar but not necessarily identical
duplicate documents, using guidelines provided by a set of 25 Thomson West
Advisory Board members. The Board then identified what degree of
terminology overlap would group truly similar documents, and R&D
subsequently conducted a test phase that incorporated this input and
measured the algorithm's performance against real search results.
The results were both technically sound and satisfying to customers.
Tests conducted on a version of the algorithm developed by the Novus group
show that on average it takes under one second to "de-dup" a result set,
including 1,000 or more documents. Meanwhile the algorithm captures 97
percent of the duplicates identified by human experts. R&D nonetheless
recommended that documents found to be duplicates not be completely
removed but simply de-emphasized in the display, so that they are still
available for researchers who wish to inspect all versions of a story.
The key to the deduping algorithm is the way it checks pairs of news
articles for similarity. Each comparison relies on a document's "digital
signature"—a set of features that includes a vector of characteristic
terms together with the article's publication date and length. What is
remarkable about digital signature technology is that it is both compact
(requires minimal storage space) and fast (performs efficient
comparisons). The other key to the algorithm's success is its ability to
apply rules associated with document dates and length to screen document
pairs and restrict the use of more costly comparisons.
This de-duping technology for BIN is scheduled to go live on Novus in
the first quarter of 2004. Team members, from left: |