Using Python’s Scikit-Learn(Sklearn) for Data Science

Mark Kirby
7 min readJan 25, 2022
Photo by National Cancer Institute on Unsplash

Python’s Scikit-Learn library is an excellent tool for Data Science and Statistical Learning. In this article I will explore some cool datasets included in the Scikit-Learn library and some of the other features that can give us insights into the information that is hidden in large complex datasets. To use the Scikit-Learn library we begin by using the import sklearn command in Python. If you do not have it installed in your system you may need to use the !pip install sklearn command.

Sklearn.Datasets

Sklearn.datasets makes a lot of wonderful datasets publicly available to you so that you can use them for statistical learning. These datasets contain data about everything from markers that indicate when cancer is present to pictures of digits for classification. They are also good to practice implementing many of Sklearn’s other features so that you can use them on your own data or to supplement the data you have.

To use this feature I first import the datasets part of the Sklearn library like this: from sklearn import datasets. This will allow you to use any of the many datasets in the package. If you want a list of all the functions within the Sklearn.datasets you can use the dir(sklearn.datasets) line of Python code to see all of the functions, which will indicate the datasets you can load in your Python environment. For this article we will use the data on breast cancer.

Loading a dataset is as easy as this: bc_data = datasets.load_breast_cancer(). According to the Sklearn documentation below this data set is from the Kagle Wisconsin breast cancer dataset. It has data from 569 breast cancer biopsies. The data is broken into two parts. One part includes all the information from the biopsies on the radius, texture, smoothness, compactness, area, concavity and a list of 24 other factors. The other includes a list of ones and zeros indicating whether that patient had cancer or not. The computer uses all of those factors to make its determination on if the patient has cancer. To learn more about some of the datasets in sklearn.datasets please see the following links:

Standard Scaler

To achieve more accurate results, scaling the data will help. Scaling makes the data so that it is all within the same range, and you do not have one variable in the data that only varies from 0 to 1 while another variable in the data ranges for 1,000 to 1,000,000. Values like this can cause the machine to learn too much from large changes in the larger variable and not enough from small changes in the smaller variable or vice versa. To implement this I first use the Standard Scaler available by importing it like this: from sklearn.preprocessing import StandardScaler. I then create an object instance of this standard scaler called “scaler.” As you can see in the code below, I used standard scalers “fit_transform” function to transform the data into scaled data and store it in a variable called “scaled_X.”

StandarScaler and fit_transform

Model_selection.train_test_split

When you are performing statistical learning it is important that you have both a training set to allow your model to learn the patterns that are in the data, and a testing set to allow you to test your model to see how well it fit the data. Train_test_split does this. This train_test_split function is part of the model_selection module. I can import it by issuing the from sklearn import model_selection command. I then set 4 variables that will contain the training and testing data. In this case they are the X_train, X_test, y_train and y_test. Note that capitalizing the X and using a lower case for the y is a common practice in Data Science. I also need to pass in the scaled_X variable, which contains all the values from X that we scaled above, and the target or y variable. Finally I can tell the test_train_split function how to split the data by giving it a test_size variable between 0 and 1. This will indicate the portion of the data I want to use to test the model. The best train test split depends on the data, but in the code below I used 0.3 to use 70% of the data for training and the other 30% for testing. Generally it is better to have more training data than testing data so that the computer can learn from more data, but if the dataset is too small, it may be better to have a larger test set. The default for this setting is 0.25 or 25%.

train_test_split

For more information on the train_test_split method here is the link to the documentation.

Linear_model

Next we have to define the model we want to use. In Sklearn’s Linear_model library, we can select from a number of different regression techniques that can be used to learn from our data.

Linear Regression

One of the choices is LinearRegression, which uses ordinary least squares to calculate the line that best fits the data. To calculate least squares, first plot a line and then calculate the sum of the squares of all the vertical distances from the line to each of the points in the data. Next, repeat this process with a different line until you have minimized the sum of the squared distances from the line. This will be the best fitting line to the data. Linear regression can be implemented in Python using the LinearRegression method in the linear_model library. Here is the code that I used:

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)
reg.score(X_test, y_test)

Just for fun, I tested this code on our breast cancer data, and it produced a score of 0.764. This is not bad but it could be much better. The computer was trying to fit a line to the data, but this is more of a logistic problem since we are trying to find people with or without breast cancer. In the next section we will look at logistic regression and how to implement it on this data. For more on linear regression see the following sites.

Logistic Regression

Logistic regression is also available in the linear_model library, and is an important tool in analyzing the data. It helps us identify where the split should be between two groups and what factors are the most likely to determine the split. For example, in our dataset we have a group of women who were tested for breast cancer. As was detailed above in the section on datasets, 30 pieces of information were gathered from each person. This information can then be used in logistic regression to understand which information acts as a marker for breast cancer, and can help doctors more accurately recognize when people are more likely to develop it. In turn this can be used to help prevent people from getting the disease. You can read more about logistic regression in the links below.

Because our data has only two possible outcomes (namely that the patient either has cancer or does not have cancer), it is best to use the logistic regression model. Here is the code that I used:

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

bc_data = datasets.load_breast_cancer()

#bc_data.data contains all of the measurements for breast cancer patients.
bc_scaled_X = scaler.fit_transform(bc_data.data)

#bc_data.target contains the vector of patients with and without breast cancer.
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(bc_scaled_X, bc_data.target)

logistic = LogisticRegression()
logistic.fit(X_train, y_train)
logistic.score(X_test, y_test)

Every time I run this block of code, I get different results because the train_test_split function is splitting off a unique set of train and test data. The results range from about 0.95 to about 0.98, which is much higher than the results I got in the LinearRegression model above.

Conclusion

Scikit-Learn is an excellent library to analyze large complex datasets. As we saw above it has many tools that are useful in developing predictive models. It also has many good datasets that can be used to supplement data you already have. I hope that this article will help you use this library to dig deeper into datasets and uncover new and exciting insights in your data.

Thank you for reading this article.

--

--

Mark Kirby

I am a recent graduate of the Complex Systems and Data Science program at the University of Vermont. I enjoy programming and data visualization in Python.