Using Python’s Scikit-Learn(Sklearn) for Data Science

7 min readJan 25, 2022

Photo by National Cancer Institute on Unsplash

Python’s Scikit-Learn library is an excellent tool for Data Science and Statistical Learning. In this article I will explore some cool datasets included in the Scikit-Learn library and some of the other features that can give us insights into the information that is hidden in large complex datasets. To use the Scikit-Learn library we begin by using the import sklearn command in Python. If you do not have it installed in your system you may need to use the !pip install sklearn command.

Sklearn.Datasets

Sklearn.datasets makes a lot of wonderful datasets publicly available to you so that you can use them for statistical learning. These datasets contain data about everything from markers that indicate when cancer is present to pictures of digits for classification. They are also good to practice implementing many of Sklearn’s other features so that you can use them on your own data or to supplement the data you have.

To use this feature I first import the datasets part of the Sklearn library like this: from sklearn import datasets. This will allow you to use any of the many datasets in the package. If you want a list of all the functions within the Sklearn.datasets you can use the dir(sklearn.datasets) line of Python code to see all of the functions, which will indicate the datasets you can load in your Python environment. For this article we will use the data on breast cancer.

Loading a dataset is as easy as this: bc_data = datasets.load_breast_cancer(). According to the Sklearn documentation below this data set is from the Kagle Wisconsin breast cancer dataset. It has data from 569 breast cancer biopsies. The data is broken into two parts. One part includes all the information from the biopsies on the radius, texture, smoothness, compactness, area, concavity and a list of 24 other factors. The other includes a list of ones and zeros indicating whether that patient had cancer or not. The computer uses all of those factors to make its determination on if the patient has cancer. To learn more about some of the datasets in sklearn.datasets please see the following links:

sklearn.datasets.load_digits

sklearn.datasets. load_digits ( * , n_class = 10 , return_X_y = False , as_frame = False ) [source] Load and return the…

scikit-learn.org

sklearn.datasets.load_iris

Examples using sklearn.datasets.load_iris: Release Highlights for scikit-learn 0.24 Release Highlights for scikit-learn…

scikit-learn.org

sklearn.datasets.load_boston

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to…

scikit-learn.org

sklearn.datasets.load_breast_cancer

Examples using sklearn.datasets.load_breast_cancer: Post pruning decision trees with cost complexity pruning Post…

scikit-learn.org

Standard Scaler

To achieve more accurate results, scaling the data will help. Scaling makes the data so that it is all within the same range, and you do not have one variable in the data that only varies from 0 to 1 while another variable in the data ranges for 1,000 to 1,000,000. Values like this can cause the machine to learn too much from large changes in the larger variable and not enough from small changes in the smaller variable or vice versa. To implement this I first use the Standard Scaler available by importing it like this: from sklearn.preprocessing import StandardScaler. I then create an object instance of this standard scaler called “scaler.” As you can see in the code below, I used standard scalers “fit_transform” function to transform the data into scaled data and store it in a variable called “scaled_X.”

Model_selection.train_test_split

When you are performing statistical learning it is important that you have both a training set to allow your model to learn the patterns that are in the data, and a testing set to allow you to test your model to see how well it fit the data. Train_test_split does this. This train_test_split function is part of the model_selection module. I can import it by issuing the from sklearn import model_selection command. I then set 4 variables that will contain the training and testing data. In this case they are the X_train, X_test, y_train and y_test. Note that capitalizing the X and using a lower case for the y is a common practice in Data Science. I also need to pass in the scaled_X variable, which contains all the values from X that we scaled above, and the target or y variable. Finally I can tell the test_train_split function how to split the data by giving it a test_size variable between 0 and 1. This will indicate the portion of the data I want to use to test the model. The best train test split depends on the data, but in the code below I used 0.3 to use 70% of the data for training and the other 30% for testing. Generally it is better to have more training data than testing data so that the computer can learn from more data, but if the dataset is too small, it may be better to have a larger test set. The default for this setting is 0.25 or 25%.

train_test_split

For more information on the train_test_split method here is the link to the documentation.

sklearn.model_selection.train_test_split

Examples using sklearn.model_selection.train_test_split: Release Highlights for scikit-learn 0.23 Release Highlights…

scikit-learn.org

Linear_model

Next we have to define the model we want to use. In Sklearn’s Linear_model library, we can select from a number of different regression techniques that can be used to learn from our data.

Linear Regression

One of the choices is LinearRegression, which uses ordinary least squares to calculate the line that best fits the data. To calculate least squares, first plot a line and then calculate the sum of the squares of all the vertical distances from the line to each of the points in the data. Next, repeat this process with a different line until you have minimized the sum of the squared distances from the line. This will be the best fitting line to the data. Linear regression can be implemented in Python using the LinearRegression method in the linear_model library. Here is the code that I used:

from sklearn.linear_model import LinearRegression reg = LinearRegression().fit(X_train, y_train) reg.score(X_test, y_test)

Just for fun, I tested this code on our breast cancer data, and it produced a score of 0.764. This is not bad but it could be much better. The computer was trying to fit a line to the data, but this is more of a logistic problem since we are trying to find people with or without breast cancer. In the next section we will look at logistic regression and how to implement it on this data. For more on linear regression see the following sites.

sklearn.linear_model.LinearRegression

Examples using sklearn.linear_model.LinearRegression: Principal Component Regression vs Partial Least Squares…

scikit-learn.org

Linear regression - Wikipedia

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one…

en.wikipedia.org

Linear Regression

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a…

www.stat.yale.edu

What is Linear Regression? - Statistics Solutions

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to…

www.statisticssolutions.com

Logistic Regression

Logistic regression is also available in the linear_model library, and is an important tool in analyzing the data. It helps us identify where the split should be between two groups and what factors are the most likely to determine the split. For example, in our dataset we have a group of women who were tested for breast cancer. As was detailed above in the section on datasets, 30 pieces of information were gathered from each person. This information can then be used in logistic regression to understand which information acts as a marker for breast cancer, and can help doctors more accurately recognize when people are more likely to develop it. In turn this can be used to help prevent people from getting the disease. You can read more about logistic regression in the links below.

sklearn.linear_model.LogisticRegression

Logistic Regression (aka logit, MaxEnt) classifier. In the multiclass case, the training algorithm uses the one-vs-rest…

scikit-learn.org

12.1 - Logistic Regression

Logistic regression models a relationship between predictor variables and a categorical response variable. For example…

online.stat.psu.edu

Logistic regression - Wikipedia

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event…

en.wikipedia.org

What is Logistic regression? | IBM

What is logistic regression? Learn how this analytics procedure can help you predict outcomes more quickly and make…

www.ibm.com

Because our data has only two possible outcomes (namely that the patient either has cancer or does not have cancer), it is best to use the logistic regression model. Here is the code that I used:

import sklearn from sklearn.linear_model import LogisticRegression from sklearn import model_selection from sklearn.preprocessing import StandardScaler from sklearn import datasets

bc_data = datasets.load_breast_cancer()

#bc_data.data contains all of the measurements for breast cancer patients. bc_scaled_X = scaler.fit_transform(bc_data.data)

#bc_data.target contains the vector of patients with and without breast cancer. X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(bc_scaled_X, bc_data.target)

logistic = LogisticRegression() logistic.fit(X_train, y_train) logistic.score(X_test, y_test)

Every time I run this block of code, I get different results because the train_test_split function is splitting off a unique set of train and test data. The results range from about 0.95 to about 0.98, which is much higher than the results I got in the LinearRegression model above.

Conclusion

Scikit-Learn is an excellent library to analyze large complex datasets. As we saw above it has many tools that are useful in developing predictive models. It also has many good datasets that can be used to supplement data you already have. I hope that this article will help you use this library to dig deeper into datasets and uncover new and exciting insights in your data.

Thank you for reading this article.

Using Python’s Scikit-Learn(Sklearn) for Data Science

Sklearn.Datasets

sklearn.datasets.load_digits

sklearn.datasets. load_digits ( * , n_class = 10 , return_X_y = False , as_frame = False ) [source] Load and return the…

sklearn.datasets.load_iris

Examples using sklearn.datasets.load_iris: Release Highlights for scikit-learn 0.24 Release Highlights for scikit-learn…

sklearn.datasets.load_boston

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to…

sklearn.datasets.load_breast_cancer

Examples using sklearn.datasets.load_breast_cancer: Post pruning decision trees with cost complexity pruning Post…

Standard Scaler

Model_selection.train_test_split

sklearn.model_selection.train_test_split

Examples using sklearn.model_selection.train_test_split: Release Highlights for scikit-learn 0.23 Release Highlights…

Linear_model

Linear Regression

sklearn.linear_model.LinearRegression

Examples using sklearn.linear_model.LinearRegression: Principal Component Regression vs Partial Least Squares…

Linear regression - Wikipedia

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one…

Linear Regression

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a…

What is Linear Regression? - Statistics Solutions

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to…

Logistic Regression

sklearn.linear_model.LogisticRegression

Logistic Regression (aka logit, MaxEnt) classifier. In the multiclass case, the training algorithm uses the one-vs-rest…

12.1 - Logistic Regression

Logistic regression models a relationship between predictor variables and a categorical response variable. For example…

Logistic regression - Wikipedia

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event…

What is Logistic regression? | IBM

What is logistic regression? Learn how this analytics procedure can help you predict outcomes more quickly and make…

Conclusion

Written by Mark Kirby