Machine Learning with Scikit-learn. Supervised learning. Linear regression.

Volodymyr Kirichinets
8 min readApr 21, 2018

The original PDF Book, from original Author, with full package of examples for $3:

Introduction

The linear regression described in this article as a supervised learning approach— the case when incoming data comes with additional attributes that we want to predict, for example, some values of wages related to a person (a person has a salary that changes over a period of time-growth or fall)— one of the main categories of Machine Learning. Typically, this can be described as cases where data samples have some features associated with this data. This linear approach has many practical applications, for example, in finance — price modeling in the future, physics, medicine, electronics and, of course, Machine Learning. Several areas of linear regression from the Scikit-learn python toolkit for data mining and data analysis will be demonstrated here. This article is not strong and fully disclosed in the mathematical sense. Some varieties of linear regressions, a deep and detailed understanding are not disclosed. The main goal is, as far as possible, to more accurately and clearly describe the processes that arise when we use the LinearRegression class of the Scikit-learn.

Linear regression

Definition

Linear regression is a linear approach for modeling the relationship between a scalar — dependent variable — predictable value — price of the currency#1 in the future — usually denoted as y and one or more explanatory variables — independent variables — known values(observed data) — prices of the currency#1 in the past —usually denoted as X .

Mathematical formulation

In the mathematical representation, the linear regression model could be written in a form as:

y = + ε

where y is a regressand, X is a regressor, β is a parameter vector or coefficient of the slope of the regression line and ε is an error or noise. In simple terms, this can be characterized as: our predictable value y is a sum of our observed value X multiplied on the coefficient of the slope β and some error ε. The objective of the linear regression method is to estimate the slope coefficient β and construct a linear regression line, which completely characterizes the regression. In the sense that the number of values of X is very large, we can neglect the error ε and construct our regression line, which has the aggregated shortest distance to each value of X from this straight line.

Least squares method

We can write and solve this method in equation form:

y = f (x) = ax + b

that assumed as changing of the x value will change y value depends on a and b coefficients. Let`s consider an example:

x = [1, 2, 3, 4, 5] — variables and y = [1, 2.5, 2, 3.5, 5] — variables that we want to predict the next value of y. Let’s write the equation formula with ordinary least squares method:

a∑x² + b∑x = ∑xy

a∑x + bN = ∑y

where N is a length of values. So there:

∑x = 1 + 2 + 3 + 4 + 5 = 15

∑y = 1 + 2.5 + 2 + 3.5 + 5 = 14

∑x² = 1² + 2² + 3² + 4² + 5² = 55

∑xy= 1*1 + 2*2.5 + 3*2 + 4*3.5 + 5*5 = 51

We need to solve this equations system:

a*55 + b*15 = 51

a*15 + b*5 = 14

(a*55 * b*5) - (a*15 * b*15) = (55 * 5) - (15 * 15)

275 - 225 = 50

solving a:

(51 * b*5) - (14 * b*15) = (51 * 5) - (14 * 15)

255 - 210 = 45

a = 45 / 50 = 0.9

and solving b:

(a*15 * 51) - (a*55 * 14) = (15 * 51) - (55 * 14)

765 - 770= -5

b = -5 / 50= -0,1

we have:

y = ax + b = 0.9 x + (-0.1)

so, the coordinates of the straight line from 0 to 6:

f (0) = 0.9 * 0 + (-0.1) = -0.1

f (6) = 0.9 * 6 + (-0.1) = 5.3

and we can make a prediction that the next value of y will be equal to 5.3 . We can build this straight line:

try to find the coefficient of covariance between complexity:

Cov(x, y) = ( Σ(X-μ)(Y-υ) ) / n

where:

μ — mean of variables x = 3

υ mean of variables y = 2.8

n — number of items x or y = 5

lets count:

((1 - 3)*(1 - 2.8) + (2 - 3)*(2.5 - 2.8) + (3 - 3)*(2 - 2.8) + (4 - 3)*(3.5 - 2.8) + (5 - 3)*(5 - 2.8)) / (5 - 1) = (3.6 + 0.3 + 0 + 0.7 + 4.4) / 5 = 1.8

Cov(x, y) = 1.8

lets try to find coefficients of the variance for x and y by this formula:

σ² for x = σᵥ² = ( Σ(X-μ)² ) / n - 1

σᵥ² = ((1 - 3)² + (2 - 3)² + (3 - 3)² + (4 - 3)² + (5 - 3)²) / 4= (4 + 1 + 0 + 1 + 4) / 4 = 2.5

σ² for y = σᵤ² = ( Σ(Y-υ)² ) / n - 1

σᵤ² = ((1 - 2.8)² + (2.5 - 2.8)² + (2 - 2.8)² + (3.5 - 2.8)² + (5 - 2.8)²) / 4 = (3.24 + 0.09 + 0.64 + 0.49 + 4.84) / 4 = 2.325

and standard deviation:

σᵥ = √σᵥ² = 1.58

σᵤ = √σᵤ² = 1.53

Correlation coefficient r :

( nxy) - (ΣxΣy) ) / ( √ (nΣx² - (Σx)²)(nΣy² — (Σy)²) )

r = ( (5*51) - ( 15*14) ) / ( √ (5*55 - 225) * (5*48.5 - 196) ) = (255 - 210) / ( √ (50 * 46.5) ) = 45 / √2325 = 45 / 48.22 = 0.9333

and Determination coefficient surely:

= 0.9333² = 0.87

Linear regression with Scikit-learn

Interpretation in machine learning

Interpretation of the linear regression in the machine learning with Scikit-learn toolset will try to solve this problem by the method of Ordinary Least Squares as fits a linear model with coefficients to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Take in its fit method arrays X, y and store the coefficients of the linear model in its coef_ member. Let`s construct a simple example based on the estimates above:

This will show the following output:

OUTPUTcoef_ value =  [ 0.96774194]
intercept_ value = 0.290322580645
Coeficient of Determination R² = 0.87

as we can see, this graph is very similar to that in assessments before, it can be said — equivalent. We appreciated this using the coef_ and intercept_ attributes and the score () method from the LinearRegression class. In scikit-learn LinearRegression, the coef_ attribute returns a estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features. LinearRegression intercept_ attribute returns independent term in the linear model. Method score() returns coefficient . Coefficient is a same as estimated with correlation formula above, and = 0.87 .

Example with numerical data

Let’s look at a more illustrative example using predictions. In the case of integer values, this does not lead to a large change in the previous code, just change our data and make more only to create more large data and add a predict() method.

OUTPUTcoef_ value = [ 0.98188163]
intercept_ value = -0.159683743312
Coeficient of Determination R² = 0.9932
--- Prediction # 1
1 ---> 1.5796
3 ---> 2.9303
4 ---> 3.7678
7 ---> 6.8585
10 ---> 10.2398
12 ---> 12.2036
14 ---> 14.1673
coef_ value = [ 0.97742494]
intercept_ value = -0.232121725979
Coeficient of Determination R² = 0.9968
--- Prediction # 2
1 ---> 1.079
3 ---> 3.3955
4 ---> 4.2205
7 ---> 6.9497
10 ---> 9.882
12 ---> 11.7489
14 ---> 14.212
coef_ value = [ 0.99644655]
intercept_ value = -0.449429822594
Coeficient of Determination R² = 0.9983
--- Prediction # 3
1 ---> 1.495
3 ---> 2.8801
4 ---> 3.6813
7 ---> 7.3006
10 ---> 10.29
12 ---> 11.6551
14 ---> 14.4488
coef_ value = [ 0.98743816]
intercept_ value = -0.485819008419
Coeficient of Determination R² = 0.9947

We have X_prices1 as similar to the variable X in the theoretical part and the three y_prices that we try to predict three times.

Example with text data

For working with text data we need to use additional classes from the text module of the feature_extraction package of the Scikit-learn. This is a TfidfVectorizer or CountVectorizer with TfidfTransformer. TFIDF or term frequency-inverse document frequency — a numerical statistic that is intended to reflect how important a word is to a document in the corpus(represented collection). We use features or values of y in the form of a dictionary, but this is not necessary, it’s just for a variety of uses of these features. The main and first thing that we need to understand is the equality of values in features, trainees and predicted data.

OUTPUTcoef_ values = [-2.  0.  3. -1.  1.  2. -3.]
intercept_ value = 4.0

--- Prediction:
We are students, it is True! ---> True
(0.9999999999999991)
We think this is a Philosophy. ---> Philosophy
(5.0)
Forward. ---> Hint
(4.0)
It is absolutlely False. ---> False
(1.9999999999999991)
Good Question. ---> Question
(6.000000000000001)
This Hypothesis is False. ---> Hint
(4.707106781186548)
Interested Hypothesis. ---> Hypothesis
(7.000000000000001)
Students, it is True or False! ---> True
(0.4644660940672609)
Philosophy have a lot of Questions. ---> Philosophy
(5.0)
This Hint may be True or False. ---> False
(1.1132486540518705)
False Hypothesis is not a Question. ---> Philosophy
(5.732050807568878)
Your Question was good, but False. ---> Hint
(4.0)
This Hint is Neutral. ---> Neutral
(3.292893218813452)
False Philosophy it is a True. ---> False
(1.6905989232414962)

First, we learning the X_tasks1 to fit for features in the dictionary, which are represented as dict.keys() and dict.values(). Dictionary keys are numeric and values are text strings. We have a y_tasks1 and y_tasks2 lists with predictable values and try to predict them by rotation.

Conclusion

About Numerical data vs Text data related to Scikit-learn Linear Regression — to use text data is not a good idea. Nevertheless, using the Linear Regression with numerical data is an old, reliable and very powerful approach to data research. Most important, particularly for Machine Learning with Scikit-learn is a good understanding of the equality between the features, learned and predicted values. Sure, there are no questions about how to use larger data as *.txt files, *.csv files, DB or other sources. This will become clear with a detailed study of the attached code and understanding of the principles of algorithms. Linear regression, even so, the most common and popular approach of data analysis and data prediction in various fields of human activity.

--

--