Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator

Document Type : Research Paper

Authors

1 University of Arkansas at Little Rock, USA.

2 Associate Professor, University of Arkansas at Little Rock, USA.

Abstract

This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results are the value of the similarity threshold and the list of stop words to exclude from the comparison. Earlier research has shown that the standard deviation of the token frequency distribution is strongly predictive of how useful stop words will be in improving linking performance. The research results presented here demonstrate a method for using statistics from token frequency distribution to estimate the threshold value and stop word selection likely to give the best linking results. The model was made using linear regression and validated with independent datasets.

Keywords

Main Subjects


Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. The Tenth ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, August, pp. 20-29.
Alsarkhi, A., & Talburt, J. R. (2018). A method for implementing probabilistic entity resolution. International Journal of Advanced Computer Science and Applications, 9(11), 7-15.
Alsarkhi, A., & Talburt, J. R. (2018). An analysis of the effect of stop words on the performance of the matrix comparator for entity resolution. The Journal of Computing Sciences in Colleges, 34(7), 64-71.
Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering, 19(1), 1-16.
Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. ACM Sigmod Record, 24)2(, 127-138.
Hu, L. (2014). Research on the application of regression analysis method in data classification. Journal of Networks, 9(11), 3151-3157.
Jurek-Loughrey, A., & Deepak, P. (2018). Semi-supervised and unsupervised approaches to recording pairs classification in multi-source data linkage. In Linking and Mining Heterogeneous and Multi-view Data (P. Deepak & A. Jurek eds.), pp. 55-78.
Kobayashi, F., Eram, A., & Talburt, J. (2014). Entity resolution using logistic regression as an extension to the rule-based oyster system. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), April 2008, pp. 146-151.
Li, X., Talburt, J.R., & Li, T. (2018). Scoring Matrix for Unstandardized Data in Entity Resolution. Proceedings of the International Conference on Computational Science and Computational Intelligence CSCI 2018, pp. 1087-1092.
Moustakides, G. V., & Verykios, V. S. (2009). Optimal stopping: A record-linkage approach. Journal of Data and Information Quality, 1(2), 9.
Pullen, D., Wang, P., Talburt, J., & Wu, N. (2013). Mitigating data quality impairment on entity resolution errors in student enrollment data. In Proceedings of the International Conference on Information and Knowledge Engineering (IKE).
Reuther, P. (2019) DBLP-ACM Bibliographic benchmark dataset. Retrieved April 13, 2019, https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_ datasets_for_ entity_resolution
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523.
Talburt, J. R. (2011). Entity resolution and information quality. Elsevier.
Talburt, J. R., & Zhou, Y. (2015). Entity information life cycle for big data: Master data management and information integration. Morgan Kaufmann.
Talburt, J. R., Zhou, Y., & Shivaiah, S. Y. (2009). SOG: A Synthetic Occupancy Generator to Support Entity Resolution Instruction and Research. MIT International Conference on Information Quality, pp. 91-105.
Tejada, S. (n.d.). Restaurant Benchmark Dataset. Retrieved April 13, 2019, http://www.cs.utexas. edu/users/ml /riddle/data.html
Tran, K. N., Vatsalan, D., & Christen, P. (2013). GeCo: an online personal data generator and corruptor. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2473-2476.
Wang, R. Y. (1998). A product perspective on total data quality management. Communications of the ACM, 41(2), 58-66.