Document Type : Special Issue on Pragmatic Approaches of Software Engineering for Big Data Analytics, Applications and Development
Authors
1 Ph.D., Department of Computer Science and Engineering, NSUT East Campus, Ambedkar Institute of Advanced Communication Technologies and Research, GGSIPU, India.
2 M.Tech., Department of Computer Science and Engineering, NSUT East Campus, Ambedkar Institute of Advanced Communication Technologies and Research, GGSIPU, India.
3 Professor, Department of Computer Science and Engineering, NSUT East Campus, Ambedkar Institute of Advanced Communication Technologies and Research, India.
Abstract
Keywords
Introduction
The intrusion detection systems (IDS) played a vital role in assuring security of the network from varied intrusions. An IDS can detect unauthorized access of systems from network attacks and take effective preventive measures on the basis of required security. The security providing centres manage IDS and computer hosts for attack counter measures. (Vinayakumar et al., 2019) found that there are many commercial IDS available and most of these commercial implementation are ineffective and insufficient, which introduces the need of more research on dynamic IDS. There are several Network Intrusion Detection Systems (NIDS) for precisely observing data flow to recognize intrusion in any network. The two main approaches of IDS are Misuse based IDS and Anomaly based IDS.
(Singh, Kalra & Solanki., 2019) found that in Misuse based IDS, the existing attacks that is already known holds a particular signature. The signature is based on different network packets patterns and data flow characteristics. The signatures that is used for known attacks is compared to the data flow patterns for intrusion detection. The misuse based IDS have tremendous precision, less false positives and accuracy. The misuse based IDS is unable to observe undetermined and new intrusions. The another form is Anomaly based IDS. (Xu et al., 2018) found that in this detection approach if a model detect any auspicious behaviour that is different from existing activities then any deviation from normal behaviour is considered as attack. Anomaly-based detection is preferred over signature based IDS because anomaly based have an ability to detect unknown attacks. The anomaly based IDS widely used neural networks, clustering, deep neural networks and k-means for supervised machine learning. The unsupervised machine learning techniques made their learned and tested models depending on factors as reliability, consistency and confidentiality among several activities.
Figure 1. IDS Framework (Xu et al.,2018)
Figure 1 shows the IDS framework to evaluate the optimal performance by using various classifier for improving the precision and accuracy of IDS. The classification machine learning algorithms are used to classify the data into three phases where the first is pre-processing phase in which data is sensed, as every dataset is not readable. (Koli and Chavan, 2017) found that the first and most crucial step is to sense the information correctly and if the information is text categorical then convert this categorical data into binary categorical data in case of classification. The model does fast execution with binary data and accept the data to train precisely.
(Al-Jallad, Aljnidi & Desouki., 2019) found that next phase is to find the different categories for classification, if categories are limited according to the classification mode then proceed but if category is more standard then performance would get decrease and result would become inaccurate.
To overcome the issue, the merging of sample classes into limited category on the basis of their feature is the best methodology. It helps reducing the dimensions and increasing the performance. It is necessary to sense data about balanced or unbalanced classes and to perform the basic operation of pre-processing.
In pre-processing phase, authors are using data set of KDD Cup99 for all model and then pre-process the dataset by splitting the dataset into training and testing dataset and then transform training data into standard form and after that the training phase is used to apply training data to model and analyse the outcomes.
The Machine learning algorithms is defined as supervised learning and unsupervised learning. (Xu et al., 2018) found that the IDS models is built by using supervised Learning algorithms based on signature-based approach, where the attacks are already known and have perquisite signatures for training datasets using support vector machines (SVM), linear regression, Naive Bayes, logistic regression, random forest, linear discriminant analysis, decision trees, and neural networks. These algorithms are majorly used as supervised learning algorithms to model signature-based IDS. The training datasets and parameters defines accuracy of signature-based machine learning IDS.
In last two decades, information technology is growing rapidly and computer security becomes an essence for industry, business and various fields.
Problem Statement
The available IDS is based on misuse IDS, which means the detection of available intrusions is unable to detect unknown intrusions. For solving the issues related to misuse based IDS the concept of anomaly based IDS is introduced for detection of newly emerging intrusions. The problem associated with existing security system is the behaviour of intrusions that changes periodically and require re-training of the systems. The system can predict the abnormal behaviour as normal if training set holds intrusion. To overcome the existing issues the following objectives has been made.
Objectives
The remainder phases are arranged as Section 1 that includes the introduction of IDS. Section 2 defines literature review. The Section 3 shows process framework. The Section 4 shows framework deployment. The section 5 shows learning algorithm computations .The Section 6 includes performance analysis. The Section 7 shows research limitations and Section 8 includes conclusion and future scope.
Literature Review
The information security in each and every sector makes it necessary to build a system that provides a high security, safe and reliable communication among various organization, assets and communication over the internet and other network under the threat of intrusion and misuses. (Xu et al., 2018) uses two different datasets one is Modbus-based gas pipeline control traffic and another is OPCUA-based batch processing traffic to detect attacks by using SVM and Random Forest machine learning algorithms. Authors found that the accuracy of SVM is 92.53 percent with execution time 11712 seconds and Random Forest is 99.84 percent with execution time of 281 second for dataset one and for data set two the Accuracy of SVM is 90.81 percent with execution time of 0.019 seconds and for Random forest accuracy is 99.98 percent with execution time of 52.31 seconds.
(Almseidin et al., 2017) authors conducted various experiments to test and evaluate the performance and efficiency of the machine learning algorithms as Naive Bayes, Random Forest, J48, Random Tree and Decision Tree. These classifiers depends on KDD dataset and found that Random forest have the maximum accuracy rate of 93.77 percent. (Singh, Kalra & Solanki., 2019) authors reviewed IDS based hybrid approach by using machine learning algorithms as SVM and KNN for future extraction and classification of data.
(Mohammadi & Namadchian, 2017) used several methods to perform experiment to determine performance of machine learning algorithms by using KDD dataset. Results showed that the minimum value of false negative is obtained by decision table classifier and the highest accuracy and precision value is obtained by random forest classifier with an accuracy of 93.77 percent and minimum false positive rate.
(Abubakar & Pranggono, 2017) authors presents machine learning IDS for SDN. Authors used signature and flow-based IDS to detect intrusions in the defined software. Pattern recognition accuracy is compared with other neural network models. The drawbacks of signature based IDS is overcome by using the flow-based anomaly technique with machine learning. The results show 97 percent accuracy of the trained model. (Yin et al., 2017) authors used KDD Cup99 dataset will be used to determine best algorithm. The implemented experiments show that SVM can be a useful tool for IDS and found that SVM achieved 94.43 percent average accuracy rate.
(Vinayakumar et al., 2019) authors showed graph based semi supervised clustering technique and a precise outlier detection approach that is used for intrusion problem in hybrid framework. Authors proposed multistage system by using machine learning algorithms to construct IDS that is foremost important implication in the information security. The experiments performed on fetched datasets shows the improvement of introduced methods. (Aljawarneh et al., 2018) authors proposed a hybrid model to determine the intrusion threshold value by using optimal features and network transaction for training. Authors found accuracy of 99.81percent for binary classification and 98.56 percent for multi class NSL-KDD datasets.
Existing Issues
Process Framework
The process framework is proposed by authors to define the procedure followed by using machine learning algorithms for IDS. Authors used KDD Cup99 dataset for analysis. The dataset is then preprocessed by using python programming language where data ambiguity is removed. The dataset is divided into independent and dependent variable and if duplicate value exist then drop them.
In pre-processing phase, the authors used dataset in csv file format and import it for data transformation by transferring the data into data frame to make it easy to read. Then the redundancy in dataset is removed and after removing the redundancy authors used binary encoder to convert all the text categorical data into binary data, as the model of machine only accept the data in numeric and binary form and binary form makes the execution fast and another benefit is to resolve the problem of columns that contains the large number of data categories. The dataset is then splits into independent and dependent data. The total columns were fifty-five, independent data contains fifty-four columns and target variable or dependent data was the last column after extracting the data the next phase was to find the missing value if exist, then replace them by mean. In next phase of pre-processing the data splits into training and test set in the ratio of (75:25) for cross validation. In next phase, the training data set is converted into a standard form, but dimensions were large with forty one columns.
In next phase the principle component analysis algorithm is used to get principle components by reducing the dimensions. In training phase, authors used the result of principle component analysis to train the model. In prediction phase, result of training model, would be used with test data to found the prediction or classification of attacks after that confusion matrix is required to compute the outcomes of the proposed methodology.
Figure 2. Proposed IDS Framework (Yin et al., 2017)
Figure 2 shows proposed IDS framework for the operation of a machine learning algorithms with intrusion detection where KDD Cup 99 dataset is observed to generate classifiers from dataset training and testing. These classifiers generate sensor result as false or normal values.
Framework Deployment
The proposed framework is implemented by using proposed algorithm and to evaluate the performance of system authors used the standard dataset as KDD Cup99 for intrusion detection. The general steps of algorithm of proposed system can be divided into three phases as pre-processing phase, training phase and prediction phase.
Importing the dataset.
Import
If dataset is not in suitable form then transforming the dataset into a suitable data frame.
dataset->data frame.
Splitting of the dataset into independent and dependent variables.
X<-dataset and Y<-dataset.
If duplicate element exists then drop.
If text categories exist then transform categorical data from text data to binary.
binary encoder->X[columns].
If missing value exist then replace missing values with mean.
missing value<-simpleImputer(missingvalue<-np.nan , strategy->mean).
missingvalue->missingvalue.fit(X).
The dataset is divided into training set and testing set having test size of =0.25.
X_train,Y_train,X_test,Y_test<-split(X,Y,0.25).
If dataset is large then do dimensional reduction.
pca->pca(no of component) and pca.fit(X_train).
Use preprocessed data.
Fitting the independent and dependent training set into model.
Classifier(X_train, Y_train).
Execute.
Result.
Use the preprocesses test data.
Predict the Outcomes based on passing the independent testing set to classifier.
Classifier->Predict(X_test).
Data Analytics Tools
Authors used Python with Pandas including pandas data frames and libraries to develop the IDS by using machine learning aspects. Python and Pandas is widely used in various areas including data analytics, economics and statistics. Pandas is the licensed open-source Python library which shows the better performance and user friendly information analytical tool for the Python programming language. Pandas library uses NumPy functionalities. The most important libraries in Python is NumPy and Pandas.
Data Description
Authors used KDD Cup99 dataset that is widely used for the evaluating anomaly based IDS. This 1999 data set includes huge variety of intrusions captured in network by considering data from DARPA’98 IDS program. The KDD Cup99 training dataset introduces 4,900,000 single connection vectors which holds labelled 41 features as attack or secured with one specific intrusion type. The intrusions will fall in one of the following categories:
The most of the attacks that is known to the system is assigned to a unique signatures that makes it feasible to detect attacks. The KDD Cup99 datasets is containing additional 14 attacks types and 24 training attack types in test data.
Figure 3 shows the main features of KDD Cup99 dataset. There are ten main features on which the outcome of classifier depends and by using these features the detection rate, training time and accuracy can be improved. These features are based on column indexing of the dataset.
The graphs represent the relationship between independent and dependent data and relationship between categorical to understand the best features for selection to train the model or a classifier.
Figure 4. Categorical Data Relationship
Figures 4 shows the categorical data relationship among features by describing the relationship between the features that is related with each other. The graphs shows the relation among Label and Service, Label and Protocol_type, Label an Flag and Protocol_type an Service.
Formulation of Prediction Metrics
Authors specify that prediction metrics is used to determine the results by using algorithmic computations.
Here TP- True Positive, TN-True Negative, FN-False Negative and FP- False Positive.
Learning Algorithms Computations
The Machine learning algorithms is defined as supervised learning and unsupervised learning. (Xu et al., 2018) found that the IDS models is built by using supervised Learning algorithms based on signature-based approach, where the attacks are already known and have perquisite signatures for training datasets are support vector machines (SVM), Naive Bayes and random forest. The selection of classification algorithms is made by considering key factors as interpretability, accuracy, nature of data, model assumptions and rate of convergence.
Naive Bayes
The Naive Bayes classifiers are a combination of classification algorithms based on Bayes’ theorem. (Aburomman and Reaz, 2017) found that Naïve bayes is a combination of algorithms which depends on a primary principal that states, each and every batch of features is being classified and is independent of each other. The Naive bayes algorithm is completely defined as conditional probability and the maximum likelihood occurrence. The Naïve bayes classifiers is a family of the probabilistic classifier which is based on bayes theorem with naïve independent assumptions among the features.
(9) |
Table 1. Performance of Naïve Bayes classifier
|
Precision |
Recall |
F1-Score |
Support |
0 |
0.99 |
0.77 |
0.87 |
10894 |
1 |
0.07 |
1.00 |
0.12 |
405 |
2 |
0.24 |
0.50 |
0.32 |
197 |
3 |
0.34 |
0.88 |
0.01 |
8 |
4 |
0.99 |
0.69 |
0.81 |
17613 |
Accuracy |
|
|
0.73 |
29117 |
Macro Avg |
0.46 |
0.77 |
0.43 |
29117 |
Weighted Avg |
0.97 |
0.73 |
0.82 |
29117 |
Table 1 determines the classification report of the model that determines the five attribute to analysis the result by computing the accuracy, precision, recall and f1- score for Naïve Bayes classification algorithms.
Figure 6 represents trained and test model comparison by using graphs that shows both the result are close to each other in the context of classification result.
Support Vector Machine
The SVM is specified as supervised machine learning algorithm which is amalgamation of learning algorithms for classification, regression and analysis of data. (Yin et al., 2017) found that SVM is defined by separating hyperplanes using discriminative classifiers. The optimal hyperplane is generated by using SVM algorithm which classifies given labelled training dataset. SVM are support vector network where model is a depiction of data points. The data in vector space is traced into different data categories that is divided by a clear gap. SVM can effectively perform linear and non linear classification by tracing inputs into dimensional feature spaces.
SVM learning method is kind of supervise learning methods used for classification. It is called margin classifiers because SVM decreases the empirical classification error and maximize the geometric mean simultaneously. (Yin et al., 2017) found that in SVM two parallel hyperplanes are constructed to divide the data inputs for creation of maximum separated hyperplanes to transform vectors into high dimensional spaces. The generalization error would depend on the marginal distance between these hyperplane.
Let consider the points <(r1, p1),(r2,p2)….(rn, pn)> and for two dimensional vector
w.r+p = 0 where w = direction vector and b is scalar constant and once the hyper plane is created then the hyperplane is used to make prediction. The hypothesis function is defined as H(r).
(10) |
This shows the points above the plane for class +1 and below the plane for class -1.The goal of hyperplane is to separate the data accurately.
This is for creating datasets where variable P denotes p_sample values and Q contains two classes
P, Q = m_b(p_samples = 500, cen = 2,
SVM_ran_s = 0, cl_s = 0.40)
import matplotlib.pyplot as plt
Table 2. Performance of SVM classifier
|
Precision |
Recall |
F1-Score |
Support |
0 |
1.00 |
1.00 |
1.00 |
10894 |
1 |
0.99 |
0.97 |
0.98 |
405 |
2 |
0.92 |
0.93 |
0.93 |
197 |
3 |
0.80 |
0.50 |
0.50 |
8 |
4 |
1.00 |
1.00 |
1.00 |
17613 |
Accuracy |
|
|
1.00 |
29117 |
Macro Avg |
0.94 |
0.88 |
0.90 |
29117 |
Weighted Avg |
1.00 |
1.00 |
1.00 |
29117 |
Table 2 shows the accuracy, precision, recall and f1- score of linear SVM classification algorithms.
Figure 8 represents trained and test model comparison by using graphs that shows both the result are close to each other in the context of classification result.
Random Forest
The other form of classification and regression model is random forest which is the supervised learning algorithm. (Xu et al., 2018) found that the random forest algorithm generates decision trees for obtaining prediction from each and every specified data sample and opted the best solution. Random forest method is selected over decision tree algorithm as it averages the result by reducing over fitting. The random forest algorithm involves selection of samples randomly from dataset to construct a decision tree to achieve the precise prediction for every decision tree. The decision tree and prediction is made for every randomly selected sample. The most precise prediction is selected as a final result.
Random forests comes from a combination of tree classifier in which individual decision tree depends on the same distribution pattern for entire tree and independently sampled random vector values. The random forest algorithm combines outcomes of generated multiple decision trees as base classifiers. (Yin et al., 2017) found that for generalization of classifiers the key issues are correlation between base trees and adaptability of each decision tree.
(11) |
normfi sub(i) = normalized feature i, fi sub(i) = feature i.
(12) |
RFfi sub(i) = feature i that is calculated from all trees, normfi (ij) = i normalized features for tree j.
Table 3. Performance of Random Forest classifier
|
Precision |
Recall |
F1-Score |
Support |
0 |
1.00 |
1.00 |
1.00 |
10894 |
1 |
1.00 |
1.00 |
1.00 |
405 |
2 |
0.98 |
0.97 |
0.98 |
197 |
3 |
0.71 |
0.62 |
0.67 |
8 |
4 |
1.00 |
1.00 |
1.00 |
17613 |
Accuracy |
|
|
1.00 |
29117 |
Macro Avg |
0.94 |
0.92 |
0.93 |
29117 |
Weighted Avg |
1.00 |
1.00 |
1.00 |
29117 |
Table 3 determines the accuracy, precision, recall and f1- score for Random Forest classification algorithms.
Figure 10 represents trained and test model comparison by using graphs that shows both the result are close to each other in the context of classification result.
Performance Analysis
The performance analysis of SVM, Random Forest and Naïve Bayes machine learning algorithm is computed as classification report for all the models as it shows the optimal classifier for classification.
Table 4. Performance of Classifiers
Method |
Accuracy |
Precision |
Recall |
F1-Score |
Naïve Bayes |
0.73 |
0.5 |
0.77 |
0.73 |
Linear SVM |
1.00 |
1.00 |
0.88 |
0.91 |
Random Forest |
1.00 |
0.99 |
0.89 |
0.91 |
Table 4 determines the accuracy, precision, recall and f1- score for Naïve Bayes, SVM and Random Forest classification algorithms.
Figure 11. Classification Result Analysis
Figure 11 shows the classification result analysis using graph that shows the accuracy, precision and recall value for Naïve Bayes, SVM and Random Forest.
Detection Rate
The detection rate is computed as the ratio between number of correctly detected intrusion and total number of intrusion.
(13) |
The detection rate computes that Random Forest brings out to be the best suited classifier among Naïve Bayes and SVM and the data classification report shows the detection rate of Naïve Bayes, SVM and Random Forest classifiers. The Naïve Bayes shows the detection rate of 0.971, Linear SVM shows 0.994 and Random Forest shows 0.999.
Research Limitations
The research limitations presents the granularity of research faced by authors for deploying IDS model and shows the limitations of the research followed by authors
Conclusion and Future Scope
The information security is the important aspect to build secure and reliable networks by providing rapid development to information technology. In this paper authors deployed IDS model by using machine learning algorithms as SVM, Naïve Bayes and Random forest to detect intrusions and compute the performance of various algorithms. Authors used KDD Cup99 dataset for comparing the accuracy, precision and detection rate of all the algorithms and for categorization authors used the binary encoder. The performance analysis computes that Random Forest is the best classifier among Naïve Bayes and SVM, and the classification report predicts that Naïve Bayes shows the detection rate of 0.971, SVM shows 0.994 and Random Forest shows 0.999. The paper provides reasonable insight to the research objectives. The future works in the area of IDS are: