Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources
Sergio Luján-Mora, Manuel Palomar
Proceedings of the Second International Conference in Advances in Web-Age Information Management (WAIM 2001),
p. 191-202: Lecture Notes in Computer Science 2118, Xi'an (China), July 9-11 2001. https://doi.org/10.1007/3-540-47714-4_18
(WAIM'2001)
Congreso internacional / International conference
Resumen
The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.