Author: Josh AmataLast Updated: Wed, Oct 26, 2022
Machine learning (ML) is a subset of Artificial Intelligence (AI) focused on building systems that learn by leveraging data to improve task performance. These systems run on machine learning algorithms that grasp patterns and make predictions from data. Examples of machine learning systems include speech recognition, email filtering, and computer vision.
Machine learning has two major approaches:
Supervised Learning: utilizes labeled datasets to train algorithms that can accurately classify data or predict outcomes. The model feeds on input data, with weights continuously adjusted, so it fits appropriately. Use cases include spam classification in emails.
Unsupervised Learning: utilizes unlabeled datasets to discover patterns for association or clustering problems.
Scikit-Learn is a python module for machine learning built on top of SciPy, NumPy, and matplotlib. It provides efficient tools for predictive data analysis, including classification algorithms for applications like spam detection and image detection, regression algorithms for applications like stock price analysis, and clustering algorithms for grouping applications.
This guide explains how to build a machine learning classifier in Python using Scikit-Learn.
Working knowledge of Python.
Properly installed and configured python toolchain, including pip (Python version >= 3.3).
To create an isolated virtual environment for your application:
virtualenv python package:
$ pip install virtualenv
Create the project directory:
$ mkdir ml_classifier
Navigate into the new directory:
$ cd ml_classifier
Create the virtual environment:
$ python3 -m venv env
This creates a new folder named
env containing scripts to control the virtual environment, including program libraries.
Activate the virtual environment:
$ source env/bin/activate
To install Scikit Learn, enter the following command:
$ pip install -U scikit-learn
Pandas is a fast, flexible and powerful tool used for data analysis and manipulation that makes working with labeled data straightforward and intuitive through expressive data structures (including Series and DataFrame). Series is a one-dimensional labeled array capable of holding data of any type, while DataFrame is a two-dimensional mutable tabular data structure with labeled axes (rows and columns). Pandas features include:
Size mutability: DataFrames support insertion or deletion of columns.
Effortless handling of missing data.
Intuitive merging and joining of datasets
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
To install pandas, enter the following command:
$ pip install pandas
This guide uses pandas to load and manipulate the dataset.
The dataset used in this guide is the Banknote Authentication Dataset. This dataset contains 1372 items representing images of bank notes with four predictor variables (image variance, skewness, kurtosis, entropy), and an encoded variable to predict the note's authenticity (0 - authentic, 1 - forged).
3.6216,8.6661,-2.8073,-0.44699,0 4.5459,8.1674,-2.4586,-1.4621,0 â¦ -1.3971,3.3191,-1.3927,-1.9948,1 0.39012,-0.14279,-0.031994,0.35084,1 â¦
This guide uses this dataset to build a machine-learning classification model to predict the authenticity of a note.
Download the dataset specified above into the project directory. The directory should look like this:
. âââ env âââ data_banknote_authentication.txt
data_banknote_authentication.txt is a CSV (Comma-Separated Values) file containing the dataset.
main.py file inside the project directory:
$ touch main.py
main.py, and add the following line:
import pandas as pd
This imports the pandas library into scope to parse the downloaded data set. Load the dataset:
datast = pd.read_csv( "data_banknote_authentication.txt", header=0, names=['image_variance', 'skewness', 'kurtosis', 'entropy', 'forged'])
This uses the
read_csv function from the pandas library to load the downloaded dataset. read_csv takes the path to the dataset as the first argument, because the dataset is within the same directory as the main.py file - specify just the filename.
header=0 is an optional argument specifying the row number to use as the column names. Passing a header value of 0 overrides the column names - as no row specifies column names in the file.
names=[...] is another optional argument specifying the user-defined column names to use. Passing header=0 before this option lets you override the column names with the provided ones. This sets the column names from column 1..5 to ('image_variance', 'skewness', 'kurtosis', 'entropy', 'forgedâ) respectively.
Printing the dataset displays the loaded DataFrame with the specified column names as follows:
print(dataset) Output: image_variance skewness kurtosis entropy forged 0 4.54590 8.16740 -2.4586 -1.46210 0 1 3.86600 -2.63830 1.9242 0.10645 0 2 3.45660 9.52280 -4.0112 -3.59440 0 3 0.32924 -4.45520 4.5718 -0.98880 0 4 4.36840 9.67180 -3.9606 -3.16250 0 ... ... ... ... ... ... 1366 0.40614 1.34920 -1.4501 -0.55949 1 1367 -1.38870 -4.87730 6.4774 0.34179 1 1368 -3.75030 -13.45860 17.5932 -2.77710 1 1369 -3.56370 -8.38270 12.3930 -1.28230 1 1370 -2.54190 -0.65804 2.6842 1.19520 1 [1371 rows x 5 columns]
Next, add the following line to split the loaded dataset into its features and binary result set:
features = dataset[dataset.columns[0:4]] forged = dataset['forged']
Pandas DataFrames lets you index columns like you would a dictionary. The first line:
features = dataset[dataset.columns[0:4]]
Creates a new DataFrame variable called
features containing the first four columns (image_variance, skewness, kurtosis, entropy) of the loaded dataset. These four columns are the features of the dataset that derive a binary classification of 0 representing an authentic note and 1 representing a forged note.
Next, create a new variable called
forged containing the last column specifying the classification result:
forged = dataset['forged']
Supervised learning is the machine learning approach used in this example.
During the evaluation of a classifier model, it is imperative to always test the model on unseen data to determine its performance. A split of 70% for training data and 30% for test data is a good ratio. Train the model using the training data, and evaluate using the test data. This approach measures the models' performance and robustness.
Scikit Learn provides a
train_test_split function which splits a given dataset into different sets. To use this function to split the data, add the following lines:
â¦ from sklearn.model_selection import train_test_split â¦ # Split the data train, test, train_labels, test_labels = train_test_split(features, forged, test_size=0.30, random_state=42)
The first line imports the train_test_split function. This function takes a variable number of array arguments (a sequence of indexables with the same length or shape) and five optional arguments for different tuning effects. Pass the features set containing the dataset attributes and the forged set containing the classification as arguments.
test_size optional argument takes a float between 0.0 and 1.0. This represents the part of the dataset to include in the test split. A value of 0.30 translates to using 30% of the dataset as test data. test_size has a default value of 0.25.
random_state optional argument controls the shuffling applied to the data before applying the split. This takes an integer in the range [0, 2^32 - 1] or a
numpy.random.RandomState instance. Passing a seed value between 0 and 42 is frequently used as it usually provides enough randomization of the data.
Randomization of data split between training and test data is important to remove selection and accidental bias.
The train_test_split function returns 2 ^ numofarrays_passed lists containing the train-test split of inputs. Passing two arguments yields 2 ^ 2 = 4 lists.
train: training part of the features set.
test: test part of the features set.
train_labels: training part of the forged set containing the classification result.
test_labels: test part of the forged set containing the classification result.
There are many machine learning models, each with its strengths and weaknesses. A Naive Bayes model adopts the class conditional independence principle from the Bayes Theorem. This means that the presence of one feature does not impact the presence of another in the probability of a given outcome, with each predictor having an equal effect on the result.
Naive Bayes usually works well for binary classification tasks like this. To initialize the model, add the following lines:
â¦ from sklearn.naive_bayes import GaussianNB â¦ # Initialize the classifier gnb = GaussianNB() # Train the classifier model = gnb.fit(train, train_labels) # Make predictions preds = gnb.predict(test)
First, import the
GaussianNB module. Then, initialize the classifier:
# Initialize the classifier gnb = GaussianNB()
After initializing the classifier, the next step is to train the model by fitting it to the data using the
# Train the classifier model = gnb.fit(train, train_labels)
The fit method takes two arguments - a two-dimensional array-like object with n number of samples and m number of features and a one-dimensional array-like object with n target values for the classification.
This trains the model using the training split part of the dataset with its targeted classification.
After training the model, use the trained model to make predictions on the split test dataset by adding the following lines:
â¦ # Make predictions preds = gnb.predict(test)
This uses the
predict method to return a one-dimensional array composed of 0's and 1's representing the predicted values for the authenticity class (0 - authentic, 1 - forged).
To evaluate the modelsâ accuracy, compare the test_labels and preds array using Scikit Learnâs
accuracy_score function. Add the following lines:
â¦ from sklearn.metrics import accuracy_score â¦ # Evaluate accuracy accuracy = accuracy_score(test_labels, preds) print("Naive Bayes accuracy -> ", accuracy)
First, import the accuracy_score function from sklearn. The function takes two one-dimensional arrays of the same size as arguments and computes the accuracy score using both arrays. The highest value is 1.0, which signifies the best performance.
While Naive Bayes is a good model for this banknote problem, real-world problems often require tests across different classification models. Another classification model is the Support Vector Machine (SVM), which maps data to a high-dimensional feature space for categorization even when the data is not linearly separable.
To add this classification model, add the following lines:
â¦ from sklearn.svm import SVC â¦ # Initialize SVC the classifier sv = SVC() # Train the classifier sv_model = sv.fit(train, train_labels) # Make predictions sv_preds = sv.predict(test) # Evaluate accuracy sv_accuracy = accuracy_score(test_labels, sv_preds) print("SVM accuracy -> ", sv_accuracy)
Adding multiple models lets you compare the accuracy across the different classifiers to find the best-performing one.
For reference, the final code in the
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.metrics import accuracy_score dataset = pd.read_csv( "data_banknote_authentication.txt", header=0, names=['image_variance', 'skewness', 'kurtosis', 'entropy', 'forged']) features = dataset[dataset.columns[0:4]] forged = dataset['forged'] # Split the data train, test, train_labels, test_labels = train_test_split(features, forged, test_size=0.30, random_state=42) # Initialize the classifier gnb = GaussianNB() # Train the classifier model = gnb.fit(train, train_labels) # Make predictions preds = gnb.predict(test) # Evaluate accuracy accuracy = accuracy_score(test_labels, preds) print("Naive Bayes accuracy -> ", accuracy) # Initialize SVC the classifier sv = SVC() # Train the classifier sv_model = sv.fit(train, train_labels) # Make predictions sv_preds = sv.predict(test) # Evaluate accuracy sv_accuracy = accuracy_score(test_labels, sv_preds) print("SVM accuracy -> ", sv_accuracy)
Open a terminal inside the virtual environment, enter:
$ python3 main.py
Naive Bayes accuracy -> 0.8422330097087378 SVM accuracy -> 0.9951456310679612
This shows that the Naive Bayes classifier has an accuracy score of about 0.842, which means that the model will be accurate 84.2% of the time. While the SVM classifier has a higher accuracy score of 0.995, translating to accurate classifications 99.5% of the time.
Here the SVM classifier outperforms the Naive Bayes for this banknote authentication problem.
This guide covered how to build a machine learning classifier model in Python using Scikit Learn and evaluate model accuracy. For more information, check out the Scikit Learn website.