Faculty of Management, University of Tehran Journal of Information Technology Management 2980-7972 10 4 2018 12 01 The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution 1 11 72757 10.22059/jitm.2019.270013.2324 EN Yumeng Ye MSC, Department of Information Quality Program, University of Arkansas at Little Rock, Arkansas, USA. John Talburt Prof., Department of Information Science, University of Arkansas at Little Rock, Arkansas, USA. Journal Article 2018 11 21 This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise linking decisions, not just the pairwise classifications alone. Part of the problem is that the measures of precision and recall as calculated in data mining classification algorithms such as logistic regression is different from applying these measures to entity resolution (ER) results.. As a classifier, logistic regression precision and recall measure the algorithm’s pairwise decision performance. When applied to ER, precision and recall measure how accurately the set of input references were partitioned into subsets (clusters) referencing the same entity. When applied to datasets containing more than two references, ER is a two-step process. Step One is to classify pairs of records as linked or not linked. Step Two applies transitive closure to these linked pairs to find the maximally connected subsets (clusters) of equivalent references. The precision and recall of the final ER result will generally be different from the precision and recall measures of the pairwise classifier used to power the ER process. The experiments described in the paper were performed using a well-tested set of synthetic customer data for which the correct linking is known. The best F-measure of precision and recall for the final ER result was obtained by substantially increasing the threshold of the logistic regression pairwise classifier.

https://jitm.ut.ac.ir/article_72757_0af205bf0cf29741afee7e3f17b8062e.pdf

Faculty of Management, University of Tehran Journal of Information Technology Management 2980-7972 10 4 2018 12 01 Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator 12 26 72758 10.22059/jitm.2019.274871.2332 EN Awaad Al-Sarkhi University of Arkansas at Little Rock, USA. John R. Talburt Associate Professor, University of Arkansas at Little Rock, USA. Journal Article 2019 01 27 This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results are the value of the similarity threshold and the list of stop words to exclude from the comparison. Earlier research has shown that the standard deviation of the token frequency distribution is strongly predictive of how useful stop words will be in improving linking performance. The research results presented here demonstrate a method for using statistics from token frequency distribution to estimate the threshold value and stop word selection likely to give the best linking results. The model was made using linear regression and validated with independent datasets.

https://jitm.ut.ac.ir/article_72758_72b897868e02c41658fb6caf9ff2f3a8.pdf

Faculty of Management, University of Tehran Journal of Information Technology Management 2980-7972 10 4 2018 12 01 Framework for Prioritizing Solutions in Overcoming Data Quality Problems Using Analytic Hierarchy Process (AHP) 27 40 72759 10.22059/jitm.2019.274888.2333 EN Imelda Doharta Aritonang Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java. Achmad Nizar Nizar Hidayanto Prof., Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java. Nur Fitriah Ayuning Budi MSc., Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java. Rahmat M. Samik Ibrahim MSc., Department of Information System, Faculty of Computer Science, Universitas Indonesia, Depok, West Java. Solikin Solikin MSc., STMIK BIna Insani, Bekasi, Jawa Barat. Journal Article 2019 01 28 The Central Statistics Agency (BPS) is a government institution that has the authority to carry out statistical activities in the form of censuses and surveys, to produce statistical data needed by the government, the private sector and the general public, as a reference in planning, monitoring, and evaluation of development results. Therefore, providing quality statistical data is very decisive because it will have an impact on the effectiveness of decision making. This paper aims to develop a framework to determine priority of solutions in overcoming data quality problems using the Analytic Hierarchy Process (AHP). The framework is built by conducting interviews and Focus Group Discussion (FGD) on experts to get the interrelationship between problems and solutions. The model that has been built is then tested in a case study, namely the Central Jakarta Central Bureau of Statistics (BPS). The results of the study indicate that the proposed model can be used to formulate solutions to data problems in BPS.

https://jitm.ut.ac.ir/article_72759_38e5f7689bf9a665b2dc54f471bc159f.pdf

Faculty of Management, University of Tehran Journal of Information Technology Management 2980-7972 10 4 2018 12 01 Investigating the Role of Code Smells in Preventive Maintenance 41 63 72760 10.22059/jitm.2019.274968.2335 EN Junaid Ali Reshi , M.Tech Student, Department of Computer Science and Technology, Central University of Punjab, Bhatinda, Punjab, India. 0000-0002-9475-5943 Satwinder Singh Assistant Prof., Department of Computer Science and Technology, Central University of Punjab, Bhatinda, Punjab, India. 0000-0001-8689-9878 Journal Article 2019 01 31 The quest for improving the software quality has given rise to various studies which focus on the enhancement of the quality of software through various processes. Code smells, which are indicators of the software quality have not been put to an extensive study for as to determine their role in the prediction of defects in the software. This study aims to investigate the role of code smells in prediction of non-faulty classes. We examine the Eclipse software with four versions (3.2, 3.3, 3.6, and 3.7) for metrics and smells. Further, different code smells, derived subjectively through iPlasma, are taken into conjugation and three efficient, but subjective models are developed to detect code smells on each of Random Forest, J48 and SVM machine learning algorithms. This model is then used to detect the absence of defects in the four Eclipse versions. The effect of balanced and unbalanced datasets is also examined for these four versions. The results suggest that the code smells can be a valuable feature in discriminating absence of defects in a software.

https://jitm.ut.ac.ir/article_72760_08e4a5599636c1e62e341c1d67adff80.pdf

Faculty of Management, University of Tehran Journal of Information Technology Management 2980-7972 10 4 2018 12 01 Big Data Quality: From Content to Context 64 71 72762 10.22059/jitm.2019.72762 EN Ahmad Khalilijafarabad PhD, Department of Information Technology Management, Faculty of Management, University of Tehran, Tehran, Iran. Journal Article 2019 09 21 Over the last 20 years, and particularly with the advent of Big Data and analytics, the research area around Data and Information Quality (DIQ) is still a fast growing research area. There are many views and streams in DIQ research, generally aiming at improving the effectiveness of decision making in organizations. Although there are a lot of researches aimed at clarifying the role of BIG data quality for organizations, there is no comprehensive literature review that shows the main differences between traditional data quality researches and Big Data quality researches. This paper analyzed the papers published in Big data quality and find out that there is almost no new mainstream about Big Data quality. It is shown in this paper that the main concepts of data quality does not changes in Big Data context and that only some new issues have been added to this area.

https://jitm.ut.ac.ir/article_72762_ade9cf8ca3448807a0edb217c756179d.pdf

Faculty of Management, University of Tehran Journal of Information Technology Management 2980-7972 10 4 2018 12 01 Perspectives of Big Data Quality in Smart Service Ecosystems (Quality of Design and Quality of Conformance) 72 83 72763 10.22059/jitm.2019.72763 EN Markus Helfert Ph.D., Head of Business Informatics Group, Department of Computing, Dublin City University, Dublin, Ireland. Mouzhi Ge Associate Professor, Department of Computer Systems and Communications, Faculty of Informatics, Masaryk University, Brno, Czech Republic. Journal Article 2019 09 21 Despite the increasing importance of data and information quality, current research related to Big Data quality is still limited. It is particularly unknown how to apply previous data quality models to Big Data. In this paper we review Big Data quality research from several perspectives and apply a known quality model with its elements of conformance to specification and design in the context of Big Data. Furthermore, we extend this model and demonstrate it utility by analyzing the impact of three Big Data characteristics such as volume, velocity and variety in the context of smart cities. This paper intends to build a foundation for further empirical research to understand Big Data quality and its implications in the design and execution of smart service ecosystems.

https://jitm.ut.ac.ir/article_72763_804dac558197e9e9dc2997c751d2eff9.pdf