The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
Yumeng
Ye
MSC, Department of Information Quality Program, University of Arkansas at Little Rock, Arkansas, USA.
author
John
Talburt
Prof., Department of Information Science, University of Arkansas at Little Rock, Arkansas, USA.
author
text
article
2018
eng
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise linking decisions, not just the pairwise classifications alone. Part of the problem is that the measures of precision and recall as calculated in data mining classification algorithms such as logistic regression is different from applying these measures to entity resolution (ER) results.. As a classifier, logistic regression precision and recall measure the algorithm’s pairwise decision performance. When applied to ER, precision and recall measure how accurately the set of input references were partitioned into subsets (clusters) referencing the same entity. When applied to datasets containing more than two references, ER is a two-step process. Step One is to classify pairs of records as linked or not linked. Step Two applies transitive closure to these linked pairs to find the maximally connected subsets (clusters) of equivalent references. The precision and recall of the final ER result will generally be different from the precision and recall measures of the pairwise classifier used to power the ER process. The experiments described in the paper were performed using a well-tested set of synthetic customer data for which the correct linking is known. The best F-measure of precision and recall for the final ER result was obtained by substantially increasing the threshold of the logistic regression pairwise classifier.
Journal of Information Technology Management
Faculty of Management, University of Tehran
2980-7972
10
v.
4
no.
2018
1
11
https://jitm.ut.ac.ir/article_72757_0af205bf0cf29741afee7e3f17b8062e.pdf
dx.doi.org/10.22059/jitm.2019.270013.2324
Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator
Awaad
Al-Sarkhi
University of Arkansas at Little Rock, USA.
author
John
R. Talburt
Associate Professor, University of Arkansas at Little Rock, USA.
author
text
article
2018
eng
This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results are the value of the similarity threshold and the list of stop words to exclude from the comparison. Earlier research has shown that the standard deviation of the token frequency distribution is strongly predictive of how useful stop words will be in improving linking performance. The research results presented here demonstrate a method for using statistics from token frequency distribution to estimate the threshold value and stop word selection likely to give the best linking results. The model was made using linear regression and validated with independent datasets.
Journal of Information Technology Management
Faculty of Management, University of Tehran
2980-7972
10
v.
4
no.
2018
12
26
https://jitm.ut.ac.ir/article_72758_72b897868e02c41658fb6caf9ff2f3a8.pdf
dx.doi.org/10.22059/jitm.2019.274871.2332
Framework for Prioritizing Solutions in Overcoming Data Quality Problems Using Analytic Hierarchy Process (AHP)
Imelda
Aritonang
Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java.
author
Achmad
Nizar Hidayanto
Prof., Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java.
author
Nur Fitriah
Ayuning Budi
MSc., Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java.
author
Rahmat M.
Samik Ibrahim
MSc., Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java.
author
Solikin
Solikin
MSc., STMIK BIna Insani, Bekasi, Jawa Barat.
author
text
article
2018
eng
The Central Statistics Agency (BPS) is a government institution that has the authority to carry out statistical activities in the form of censuses and surveys, to produce statistical data needed by the government, the private sector and the general public, as a reference in planning, monitoring, and evaluation of development results. Therefore, providing quality statistical data is very decisive because it will have an impact on the effectiveness of decision making. This paper aims to develop a framework to determine priority of solutions in overcoming data quality problems using the Analytic Hierarchy Process (AHP). The framework is built by conducting interviews and Focus Group Discussion (FGD) on experts to get the interrelationship between problems and solutions. The model that has been built is then tested in a case study, namely the Central Jakarta Central Bureau of Statistics (BPS). The results of the study indicate that the proposed model can be used to formulate solutions to data problems in BPS.
Journal of Information Technology Management
Faculty of Management, University of Tehran
2980-7972
10
v.
4
no.
2018
27
40
https://jitm.ut.ac.ir/article_72759_38e5f7689bf9a665b2dc54f471bc159f.pdf
dx.doi.org/10.22059/jitm.2019.274888.2333
Investigating the Role of Code Smells in Preventive Maintenance
Junaid
Ali Reshi
, M.Tech Student, Department of Computer Science and Technology, Central University of Punjab, Bhatinda, Punjab, India.
author
Satwinder
Singh
Assistant Prof., Department of Computer Science and Technology, Central University of Punjab, Bhatinda, Punjab, India.
author
text
article
2018
eng
The quest for improving the software quality has given rise to various studies which focus on the enhancement of the quality of software through various processes. Code smells, which are indicators of the software quality have not been put to an extensive study for as to determine their role in the prediction of defects in the software. This study aims to investigate the role of code smells in prediction of non-faulty classes. We examine the Eclipse software with four versions (3.2, 3.3, 3.6, and 3.7) for metrics and smells. Further, different code smells, derived subjectively through iPlasma, are taken into conjugation and three efficient, but subjective models are developed to detect code smells on each of Random Forest, J48 and SVM machine learning algorithms. This model is then used to detect the absence of defects in the four Eclipse versions. The effect of balanced and unbalanced datasets is also examined for these four versions. The results suggest that the code smells can be a valuable feature in discriminating absence of defects in a software.
Journal of Information Technology Management
Faculty of Management, University of Tehran
2980-7972
10
v.
4
no.
2018
41
63
https://jitm.ut.ac.ir/article_72760_08e4a5599636c1e62e341c1d67adff80.pdf
dx.doi.org/10.22059/jitm.2019.274968.2335
Big Data Quality: From Content to Context
Ahmad
Khalilijafarabad
PhD, Department of Information Technology Management, Faculty of Management, University of Tehran, Tehran, Iran.
author
text
article
2018
eng
Over the last 20 years, and particularly with the advent of Big Data and analytics, the research area around Data and Information Quality (DIQ) is still a fast growing research area. There are many views and streams in DIQ research, generally aiming at improving the effectiveness of decision making in organizations. Although there are a lot of researches aimed at clarifying the role of BIG data quality for organizations, there is no comprehensive literature review that shows the main differences between traditional data quality researches and Big Data quality researches. This paper analyzed the papers published in Big data quality and find out that there is almost no new mainstream about Big Data quality. It is shown in this paper that the main concepts of data quality does not changes in Big Data context and that only some new issues have been added to this area.
Journal of Information Technology Management
Faculty of Management, University of Tehran
2980-7972
10
v.
4
no.
2018
64
71
https://jitm.ut.ac.ir/article_72762_ade9cf8ca3448807a0edb217c756179d.pdf
dx.doi.org/10.22059/jitm.2019.72762
Perspectives of Big Data Quality in Smart Service Ecosystems (Quality of Design and Quality of Conformance)
Markus
Helfert
Ph.D., Head of Business Informatics Group, Department of Computing, Dublin City University, Dublin, Ireland.
author
Mouzhi
Ge
Associate Professor, Department of Computer Systems and Communications, Faculty of Informatics, Masaryk University, Brno, Czech Republic.
author
text
article
2018
eng
Despite the increasing importance of data and information quality, current research related to Big Data quality is still limited. It is particularly unknown how to apply previous data quality models to Big Data. In this paper we review Big Data quality research from several perspectives and apply a known quality model with its elements of conformance to specification and design in the context of Big Data. Furthermore, we extend this model and demonstrate it utility by analyzing the impact of three Big Data characteristics such as volume, velocity and variety in the context of smart cities. This paper intends to build a foundation for further empirical research to understand Big Data quality and its implications in the design and execution of smart service ecosystems.
Journal of Information Technology Management
Faculty of Management, University of Tehran
2980-7972
10
v.
4
no.
2018
72
83
https://jitm.ut.ac.ir/article_72763_804dac558197e9e9dc2997c751d2eff9.pdf
dx.doi.org/10.22059/jitm.2019.72763