Logistic regression

Monday. January 27, 2020 - 9 mins

+ regression + logistic regression + binary classification + machine learning

Intro

I though I would give some introduction on basic machine learning techniques and include some basic implementation with the explanations. I started with the topic below. Hope you will enjoy reading.

Logistic regression

The main goal is to maximize the joint probability of the output label given the input data - in layman’s terms - make the predictions correct most of the time.

\[p(y | x) = \hat{y}^{y}(1-\hat{y})^{(1-y)}\]

Since this equation gives:

\[p(y=1 | x) = \hat{y}, \quad p(y=0 | x) = (1- \hat{y})\]

and we want the predictions to represent the probability of a binary classification problem. In logistic regression we apply a sigmoid function to the regression term in order to achieve [0, 1] range and make it a normalized probability:

\[\hat{y} = sigmoid(a) = sigmoid(w^{T}x + b)\]

Where x is a data point and w, b are the parameters of our regression model. In order to maximize the joint probability we take its logarithmic mapping since the logarithm is a continuous and monotonic function therefore for a data and label pair (x, y) our loss function is:

\[- L(y, \hat{y}) = y\ln\hat{y} + (1 - y)\ln(1 - \hat{y}) = \ln p(y|x)\]

We want to optimize this by stochastic gradient descent so we will need the gradients. For a dataset of

\[\{\vec{x}^{(i)}, y^{(i)}\}_{i = 1}^{N}\]

the computations are the following:

The i-th activation is:

\[a^{(i)} = \vec{w}^{T}\vec{x}^{(i)} + b\]

And the i-th prediction:

\[\hat{y}^{(i)} = sigmoid(a^{(i)}) = \frac{1}{1 + e^{-a^{(i)}}}\]

The i-th loss therefore:

\[L(y^{(i)}, \hat{y}^{(i)}) = - \big(y\ln\hat{y}^{(i)} + (1 - y^{(i)})\ln(1 - \hat{y}^{(i)})\big)\]

We still have to define the cost function which is the sum of the losses. If we do stochastic gradient descent we sample a batch of the dataset randomly and calculate the loss for

\[1 \leq m < N\]

samples.

\[J(\vec{w}, b) = \frac{1}{m}\sum_{j = 1}^{m}L(y^{(i)}, \hat{y}^{(i)})\]

The derivatives are the following:

\[\frac{\partial L}{\partial \hat{y}^{(i)}} = \frac{1 - y^{(i)}}{1- \hat{y}^{(i)}} - \frac{y^{(i)}}{\hat{y}^{(i)}} \equiv d\hat{y}^{(i)}\]

Since we know the chain rule we know that in order to get

\[\frac{\partial L}{\partial a^{(i)}}\]

we need:

\[\frac{\partial \hat{y}^{(j)}}{\partial a^{(i)}} = sigmoid(-a^{(j)})sigmoid(a^{(j)})\delta_{ji} = (1 - \hat{y}^{(j)})\hat{y}^{(j)}\delta_{ij}\]

Therefore:

\[\frac{\partial L}{\partial a^{(i)}} = \sum_{j}\frac{\partial L}{\partial \hat{y}^{(j)}}\frac{\partial \hat{y}^{(j)}}{\partial a^{(i)}} = \hat{y}^{(i)} - y^{(i)} \equiv da^{(i)}\]

Moving on the actual learnable parameters

\[\frac{\partial L}{\partial w_{p}} = \sum_{q}\frac{\partial L}{\partial a^{(q)}}\frac{\partial a^{(q)}}{\partial w_{p}} = \sum_{q} ( \hat{y}^{(q)} - y^{(q)} ) x^{(q)}_{p} \equiv dw_{p}\] \[\frac{\partial L}{\partial b} = \sum_{q}\frac{\partial L}{\partial a^{(q)}}\frac{\partial a^{(q)}}{\partial b} = \sum_{q} ( \hat{y}^{(q)} - y^{(q)} ) \cdot 1 \equiv db\]

Logistic regression in code with L2 weight regularization

Now the same in code:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

Using sklearn a default binary classification dataset can be loaded and a metric can be used to test accuracy.

X, y = load_breast_cancer(return_X_y=True)

y.reshape(y.shape[0], 1)

print('Data shape : ', X.shape, '\t labels shape : ', y.shape)

Data shape :  (569, 30) 	 labels shape :  (569,)

n_samples = X.shape[0]

test_split = np.random.choice(n_samples, size=n_samples // 10, replace=False)
assert np.unique(test_split).size == n_samples // 10, 'Test split size constraint failed'

train_split = list(set(range(0, n_samples)) - set(test_split))
assert np.unique(train_split).size == n_samples - n_samples // 10, 'Train split size constraint failed'

X_train, y_train, X_test, y_test = X[train_split], y[train_split], X[test_split], y[test_split]
y_train = y_train.reshape(n_samples - n_samples // 10, 1)
y_test = y_test.reshape(n_samples // 10, 1)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((513, 30), (513, 1), (56, 30), (56, 1))

Here I implemented some custom splitting technique with numpy only but the default train_test_split method can be used from sklearn as well. Normalizing the dataset is also important to ensure numerical stability.

train_std, train_mean = np.std(X_train, axis=0), np.mean(X_train, axis=0)
test_std, test_mean = np.std(X_test, axis=0), np.mean(X_test, axis=0)

X_train -= train_mean
X_test -= test_mean
X_train /= train_std
X_test /= test_std

X_train.shape, X_test.shape

((513, 30), (56, 30))

def sigmoid(x):
    return 1. / (1. + np.exp(-x))

sigmoid = np.vectorize(sigmoid)

learning_rate = 5e-4
m = 750
max_iter = 5_000

# regularization
alpha = 10e-3

w = np.random.randn(1, X_train.shape[1])
b = np.random.randn()

for _iter in range(max_iter):
    batch_indices = np.random.choice(X_train.shape[0], size = m, replace = True)
    batch = X_train[batch_indices].T
    _y = y_train[batch_indices].reshape(1, m)
    assert batch.shape == (X_train.shape[1], m), 'Batch shape constraint failed'
    activation = np.dot(w, batch) + b
    assert activation.shape == (1, m), 'Activation shape constraint failed'
    y_hat = sigmoid(activation)
    assert y_hat.shape == activation.shape, 'Prediction shape constraint failed'
    
    if _iter % 500 == 0:
        # Calculate current accuracy
        y_pred_test = np.dot(w, X_test.T) + b
        y_pred_test[y_pred_test >= 0.5] = 1
        y_pred_test[y_pred_test < 0.5] = 0
        print('Iter : %d | Current accuracy on test set %.2f%%: ' % (_iter, 
                                                                     100 * accuracy_score(y_test, y_pred_test.T)))
    
    da = y_hat - _y
    assert da.shape == activation.shape, 'Activation derivative shape constraint failed'
    dw = np.matmul(batch, da.T).T
    assert dw.shape == w.shape, 'Weights derivative shape constraint failed'
    db = np.sum(da)
    # Update parameters
    w -= learning_rate * ( dw + 2. * alpha * np.linalg.norm(dw) )
    b -= learning_rate * ( db + 2. * alpha * np.abs(db) ) 
    
y_pred_test = np.dot(w, X_test.T) + b
y_pred_test[y_pred_test >= 0.5] = 1
y_pred_test[y_pred_test < 0.5] = 0
print('Final accuracy : %.2f%%' % (100 * accuracy_score(y_test, y_pred_test.T)))

Iter : 0 | Current accuracy on test set 16.07%: 
Iter : 500 | Current accuracy on test set 98.21%: 
Iter : 1000 | Current accuracy on test set 96.43%: 
Iter : 1500 | Current accuracy on test set 96.43%: 
Iter : 2000 | Current accuracy on test set 96.43%: 
Iter : 2500 | Current accuracy on test set 98.21%: 
Iter : 3000 | Current accuracy on test set 98.21%: 
Iter : 3500 | Current accuracy on test set 98.21%: 
Iter : 4000 | Current accuracy on test set 98.21%: 
Iter : 4500 | Current accuracy on test set 98.21%: 
Final accuracy : 98.21%

Here the final accuracy is 98.21%. Which seems to be pretty amazing for such a simple algorithm.

@Regards, Alex

Alex Olar

Christian, foodie, physicist, tech enthusiast