Indeterministic Handling of Uncertain Decisions in Deduplication


Panse, Fabian and Ritter, Norbert and Keulen, Maurice van (2012) Indeterministic Handling of Uncertain Decisions in Deduplication. Journal of Data and Information Quality . ISSN 1936-1955

[img] PDF (Main article)
Restricted to UT campus only
: Request a copy
[img] PDF (Appendix)
Restricted to UT campus only
: Request a copy
Abstract:In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.
Item Type:Article
Copyright:© 2012 ACM
Electrical Engineering, Mathematics and Computer Science (EEMCS)
Research Group:
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page