A Genetic Programming Approach to Record Deduplication
A Genetic Programming Approach.pdf (Size: 1.34 MB / Downloads: 143)
Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce
brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that,
there have been significant investments from private and government organizations for developing methods for removing replicas from
its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality
information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this
paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence
extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or
not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the
suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming
approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the
burden of having to choose and tune this parameter.
THE increasing volume of information available in
digital media has become a challenging problem for
data administrators. Usually built on data gathered from
different sources, data repositories such as those used by
digital libraries and e-commerce brokers may present
records with disparate structure. Also, problems regarding
low-response time, availability, security, and quality
assurance become more difficult to handle as the amount
of data gets larger.
Today, it is possible to say that the capacity of an
organization to provide useful services to its users is
proportional to the quality of the data handled by its systems.
In this environment, the decision of keeping repositories with
“dirty” data (i.e., with replicas, with no standardized
representation, etc.) goes far beyond technical questions
such as the overall speed or performance of data management
systems. The solutions available for addressing this
problem require more than technical efforts, they need
management and cultural changes as well .
Record deduplication is a growing research topic in
database and related fields such as digital libraries. Today,
this problem arises mainly when data are collected from
disparate sources using different information description
styles and metadata standards. Other common place for
replicas is found in data repositories created from OCR
documents. These situations can lead to inconsistencies that
may affect many systems such as those that depend on
searching and mining tasks.
To solve these inconsistencies it is necessary to design a
deduplication function that combines the information
available in the data repositories in order to identify whether
a pair of record entries refers to the same real-world entity.
In the realm of bibliographic citations, for instance, this
problem was extensively discussed by Lawrence et al. ,
. They propose a number of algorithms for matching
citations from different sources based on edit distance, word
matching, phrase matching, and subfield extraction.
Domain Knowledge Approaches
The idea of combining evidence to identify replicas has
pushed the data management research community to look
for methods that could benefit from domain-specific
information found in the actual data as well as for methods
based on general similarity metrics that could be adapted to
specific domains .
As an example of a method that exploits general similarity
functions adapted to a specific domain, we can mention .
There the authors propose a matching algorithm that, given a
record in a file (or repository), looks for another record in a
reference file that matches the first record according to a
given similarity function. The matched reference records are
selected based on a user-defined minimum similarity
threshold. Thus, more than one candidate record may be
returned. In such cases, the user is required to choose one
among them indicating which is the closest one. Records
matching on high-weight tokens (strings) are more similar
than those matching on low-weight tokens. The weights are
calculated by the well-known IDF weighting method .
This weighting method is also used in WHIRL , a
database management system that supports similarity joins
among relations that have free text attribute values. In ,
the authors use the vector space model for computing
similarity among fields from different sources and evaluate
four distinct strategies to assigning weights and combining
the similarity scores of each field. As a result of their
experiment, they found that using evidence extracted from
individual attributes improves the results of the replica
Usually, GP evolves a population of length-free data
structures, also called individuals, each one representing a
single solution to a given problem. In our modeling, the
trees represent arithmetic functions, as illustrated in Fig. 1.
When using this tree representation in a GP-based method,
a set of terminals and functions should be defined .
Terminals are inputs, constants or zero argument3 nodes
that terminate a branch of a tree. They are also called tree
leaves. The function set is the collection of operators,
statements, and basic or user-defined functions that can be
used by the GP evolutionary process to manipulate the
terminal values. These functions are placed in the internal
nodes of the tree, as illustrated in Fig. 1.
During the evolutionary process, the individuals are
handled and modified by genetic operations such as
reproduction, crossover, and mutation , in an iterative way
that is expected to spawn better individuals (solutions to
the proposed problem) in the subsequent generations.
Experimental Data Sets
In our experiments, we used two real data sets commonly
employed for evaluating record deduplication approaches
, , which are based on real data gathered from the
web. In addition, we created three additional data sets
using a synthetic data set generator.
The first real data set, the Cora Bibliographic data set, is a
collection of 1,295 distinct citations to 122 computer science
papers taken from the Cora research paper search engine.
These citations were divided into multiple attributes (author
names, year, title, venue, and pages and other info) by an
information extraction system.
The second real data set, hereafter named the Restaurants
data set, contains 864 entries of restaurant names and
additional information, including 112 duplicates, that were
obtained by integrating records from Fodor and Zagat’s
guidebooks. We used the following attributes from this data
set: (restaurant) name, address, city, and specialty.
CONCLUSIONS AND FUTURE WORK
Identifying and handling replicas is important to guarantee
the quality of the information made available by dataintensive
systems such as digital libraries and e-commerce
brokers. These systems rely on consistent data to offer
high-quality services, and may be affected by the existence
of duplicates, quasi replicas, or near-duplicate entries in
their repositories. Thus, for this reason, there have been
significant investments from private and government
organizations for developing methods for removing
replicas from large data repositories.
In this paper, we presented a GP-based approach to
record deduplication. Our approach is able to automatically
suggest deduplication functions based on evidence present
in the data repositories. The suggested functions properly
combine the best evidence available in order to identify
whether two or more distinct record entries are replicas (i.e.,
represent the same real-world entity) or not.