Modeling Data Cleaning Techniques for Big Data
Diana Martínez-Mosquera, Sergio Luján-Mora, Fidel Parra
Proceedings of the International Conferences 16th WWW/Internet (ICWI 2017) and 14th Applied Computing (AC 2017), p. 310-313, Algarve (Portugal), October 18-20 2017. ISBN: 978-989-8533-69-2.
(ICWI'17a) Congreso internacional / International conference
Big Data is currently a popular term, it focuses on high volumes of data processed into relevant information to assist in decision making. A few researches about data cleaning techniques have been adapted to Big Data and we consider that filtering irrelevant data is an important task aimed to reduce hardware and processing time requirements. Moreover, there are separate researches about data cleaning processes in Big Data, thus, our approach proposes to model techniques used for this intent. Since logs can be considered as Big Data we have modeled two different approaches, one to clean a firewall log in a vertical dimension and another to clean a web log in a horizontal dimension. An advantage of our proposal is the use of the Unified Modeling Language, an International Organization for Standardization standard widely accepted since 2005. Consequently, the data cleaning process is composed by logical units that can be replaced or modified by the designer. Thus, the examples demonstrate the integration possibility of several clustering techniques, for example, Levenshtein Distance, Longitudinal Distance, Transposition Invariant Distance, Word Position Invariant Distance, etc.