A Suit of Record Normalization Methods, From Naive Ones, Globally Mine a Group of Duplicate Records

Mummidi Siva Sankar; Nadella Sunil

A Suit of Record Normalization Methods, From Naive Ones, Globally Mine a Group of Duplicate Records

Mummidi Siva Sankar, Nadella Sunil

Abstract

The promise of Big Data pivots after tending to a few big data integration challenges, for example, record linkage at scale, continuous data combination, and incorporating Deep Web. Although much work has been directed on these issues, there is restricted work on making a uniform, standard record from a gathering of records comparing to a similar genuine element. We allude to this errand as record normalization. Such a record portrayal, instituted normalized record, is significant for both front-end and back-end applications. In this paper, we formalize the record normalization issue, present top to bottom examination of normalization granularity levels (e.g., record, field, and worth segment) and of normalization structures (e.g., common versus complete). We propose an exhaustive structure for registering the normalized record. The proposed system incorporates a suit of record normalization techniques, from guileless ones, which utilize just the data accumulated from records themselves, to complex methodologies, which all around mine a gathering of copy records before choosing an incentive for a quality of a normalized record.

References

K. C.-C. Chang and J. Cho, â€œAccessing the web: From search to integration,â€ in SIGMOD, 2006, pp. 804â€“805.

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, â€œWebtables: Exploring the power of tables on the web,â€ PVLDB, vol. 1, no. 1, pp. 538â€“549, 2008.

W. Meng and C. Yu, Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, 2010.

A. Gruenheid, X. L. Dong, and D. Srivastava, â€œIncremental record linkage,â€ PVLDB, vol. 7, no. 9, pp. 697â€“708, May 2014.

E. K. Rezig, E. C. Dragut, M. Ouzzani, and A. K. Elmagarmid, â€œQuery-time record linkage and fusion over web databases,â€ in ICDE, 2015, pp. 42â€“53.

W. Su, J. Wang, and F. Lochovsky, â€œRecord matching over query results from multiple web databases,â€ TKDE, vol. 22, no. 4, 2010.

H. Kopcke and E. Rahm, â€œFrameworks for entity matching: A Â¨ comparison,â€ DKE, vol. 69, no. 2, pp. 197â€“210, 2010.

X. Yin, J. Han, and S. Y. Philip, â€œTruth discovery with multiple conflicting data providers on the web,â€ ICDE, 2008.

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, â€œDuplicate record detection: A survey,â€ TKDE, vol. 19, no. 1, pp. 1â€“16, 2007.

P. Christen, â€œA survey of indexing techniques for scalable record linkage and deduplication,â€ TKDE, vol. 24, no. 9, 2012.

S. Tejada, C. A. Knoblock, and S. Minton, â€œLearning object identification rules for data integration,â€ Inf. Sys., vol. 26, no. 8, pp. 607â€“633, 2001.

L. Shu, A. Chen, M. Xiong, and W. Meng, â€œEfficient spectral neighborhood blocking for entity resolution,â€ in ICDE, 2011.

Y. Jiang, C. Lin, W. Meng, C. Yu, A. M. Cohen, and N. R. Smalheiser, â€œRule-based deduplication of article records from bibliographic databases,â€ Database, vol. 2014, 2014.

X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava, â€œTruth finding on the deep web: Is the problem solved?â€ in PVLDB, vol. 6, no. 2, 2012, pp. 97â€“108.

J. Pasternack and D. Roth, â€œMaking better informed trust decisions with generalized fact-finding,â€ in IJCAI, 2011, pp. 2324â€“2329.

Full Text: PDF [Full Text]

Refbacks

There are currently no refbacks.

International Journal of Science Engineering and Advance Technology is licensed under a Creative Commons Attribution 3.0 Unported License.Based on a work at IJSEat , Permissions beyond the scope of this license may be available at http://creativecommons.org/licenses/by/3.0/deed.en_GB.

Username
Password
Remember me

A Suit of Record Normalization Methods, From Naive Ones, Globally Mine a Group of Duplicate Records

Abstract

References

Refbacks

Copyright Â© 2013, All rights reserved.| ijseat.com