Stacked boosting for photo-z estimation - a university Kaggle challenge

Saturday. April 27, 2019 - 14 mins

Stacked boosting for photo-z estimation - Kaggle challenge

On Kaggle there are many challenges to choose from and it is very easy to host an InClass one. That’s what my teachers (now collegues) did. Recently I was absorbed by university duties and did a fair amount of work, collaboration and study. I couldn’t write a post for a month. :’(

The problem here is very familiar since I’ve worked with this before and I was suprised by how much it can be improved. Ensembling is the way to go. Here is my solution to the challenge and also here’s the link to it.

The best solution was using several stacked MLPs but as teachers said stacked random forests were the way to go. My solution is presented below and was ran on Kaggle.

!nvidia-smi

Sat Mar 30 09:12:08 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

import numpy as np
import pandas as pd

import os
print(os.listdir("../input"))

['test.csv', 'train.csv', 'sample.csv']

Importing all the necesary sklearn functions and classes. The dataframe is presented below, I also created ug, gr, ri, iz colors since that’s a common way of describing this spectra (feature engineering).

from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, RobustScaler, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import AdaBoostRegressor
import lightgbm as lgb
import pandas as pd
from scipy.stats import uniform, randint

df = pd.read_csv('../input/train.csv')
df.head(n=10)

	ra	dec	u	g	r	i	z	size	ellipticity	redshift	id
0	216.0587	62.2653	19.97	18.76	17.78	17.30	16.95	11.25	0.23	0.183	0
1	43.9829	20.4813	19.85	17.92	16.69	16.21	15.84	9.08	0.15	0.178	1
2	355.8473	359.0946	19.61	18.99	18.20	17.76	17.46	2.04	0.13	0.342	2
3	223.1833	36.0321	22.92	20.45	18.90	18.23	17.88	7.47	0.43	0.400	3
4	41.6911	39.4769	22.17	22.07	20.47	19.30	18.99	12.54	0.59	0.605	4
5	193.0652	32.2492	19.58	18.39	17.73	17.35	17.13	4.27	0.15	0.115	5
6	51.6667	13.4588	20.34	18.71	17.44	16.95	16.61	11.22	0.49	0.271	6
7	52.0553	16.0821	23.38	20.67	19.04	18.40	18.08	2.31	0.06	0.319	7
8	88.5624	356.0842	19.20	17.14	16.13	15.70	15.36	8.14	0.05	0.121	8
9	183.3186	49.9559	22.94	21.78	20.49	19.57	19.20	7.27	0.15	0.543	9

df['ug'] = df['u'] - df['g']
df['gr'] = df['g'] - df['r']
df['ri'] = df['r'] - df['i']
df['iz'] = df['i'] - df['z']

relevant_variables = ['ug', 'gr', 'ri', 'iz', 'ellipticity', 'size', 'r']

I choose those afterwards to be relevant_variables since they encode all the color filters.

The main function is a stack of different boosted moodels on different random samples of the dataset in each iteration.

def bagged_models(N = 1):
    
    models = []
    rmses = []
    
    for i in range(N):
        
        X = df.sample(350_000, random_state=137*N + 4134)

        X_train, X_test, y_train, y_test = train_test_split(
            X[relevant_variables],
            X['redshift'],
            test_size=0.5,
            random_state=42*N)
        
        # XGB models
        xgb_model = XGBRegressor(tree_method = "gpu_exact", max_depth=11,
                                 reg_lambda=0.9, reg_alpha=0.3, n_estimators=80)
      
        #xgb_model = make_pipeline(RobustScaler(), xgb_model)
        
        xgb_model.fit(X_train, y_train)
        
        pred = xgb_model.predict(X_test)
        mse = mean_squared_error(pred, y_test)
        
        rmses.append(np.sqrt(mse))
        models.append(xgb_model)
        
        # LightGBM
        lgbm = lgb.LGBMRegressor(n_estimators=80, max_depth=15,
                                num_leaves=19, reg_alpha=0.4,
                                subsample=0.65, reg_lambda=0.5, silent=True)
        
        #lgbm = make_pipeline(RobustScaler(), lgbm)
        
        lgbm.fit(X_train, y_train)
        
        pred = lgbm.predict(X_test)
        mse = mean_squared_error(pred, y_test)
        
        rmses.append(np.sqrt(mse))
        models.append(lgbm)
        
        # GradBoost models
        gb = GradientBoostingRegressor(n_estimators=137, subsample=0.25,
                                       validation_fraction=0.5, max_depth=7,
                                       min_samples_leaf=3)
        
        #gb = make_pipeline(RobustScaler(), gb)
        
        gb.fit(X_train, y_train)
        
        pred = gb.predict(X_test)
        mse = mean_squared_error(pred, y_test)
        
        rmses.append(np.sqrt(mse))
        models.append(gb)
        
        # CatBoost models
        cb = CatBoostRegressor(depth=15, early_stopping_rounds=3,
                               loss_function='RMSE', learning_rate=0.1,
                               task_type = "GPU", n_estimators=80, silent=True)
        
        #cb = make_pipeline(RobustScaler(), gb)
        
        cb.fit(X_train, y_train)
        
        pred = cb.predict(X_test)
        mse = mean_squared_error(pred, y_test)
        
        rmses.append(np.sqrt(mse))
        models.append(cb)
    
    return models, rmses

import warnings
from sklearn.metrics import mean_squared_error
warnings.filterwarnings('ignore')
models, rmses = bagged_models(N=75)

X_train, X_test, y_train, y_test = train_test_split(
            df[relevant_variables],
            df['redshift'],
            test_size=0.33,
            random_state=42)

Prediction is done for all the models (75 * #number_of_different_types_of_models) on the whole test set..

preds_ = np.array([models[i].predict(X_test) for i in range(0, len(models), 1)])

preds = np.mean(preds_, axis=0)

Arbitrary check to see whether or not stacking worked at all.

np.sqrt(mean_squared_error(preds, y_test)), np.mean(rmses), np.min(rmses), np.argmin(rmses)

(0.045304009089382945, 0.04656930089201668, 0.04613510069939869, 0)

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 9))
plt.hist(y_test, bins=100, color='b')
plt.hist(preds, bins=100,  color='r', alpha=0.55)
plt.show()

plt.figure(figsize=(9, 9))
plt.plot(y_test[::40], preds[::40], 'r.')
plt.show()

It can be seen that the model does not fit the actual distribution very well and is error prone for outliers. The right tail is completely missed.

png

Not all data points are shown since there are very many and the plot looks terrible if all are present.

png

test_df = pd.read_csv('../input/test.csv')

test_df['ug'] = test_df['u'] - test_df['g']
test_df['gr'] = test_df['g'] - test_df['r']
test_df['ri'] = test_df['r'] - test_df['i']
test_df['iz'] = test_df['i'] - test_df['z']

test_data = test_df[relevant_variables]

Running the prediction on the actual test set then getting the mean of all model predictions.

test_preds = [model.predict(test_data) for model in models]
test_preds = np.mean(test_preds, axis=0)

test_pred = pd.DataFrame(test_preds, columns=['redshift'])
test_pred.to_csv('predictions.csv', index_label='id', columns=['redshift'])

pd.read_csv('predictions.csv').head(n=10)

	id	redshift
0	0	0.589428
1	1	0.252857
2	2	0.603094
3	3	0.623349
4	4	0.141112
5	5	0.527417
6	6	0.436981
7	7	0.343215
8	8	0.342766
9	9	0.539182

Some EDA from another notebook:

import seaborn as sns

sns.pairplot(data=df.sample(30_000), vars=['redshift', 'ra', 'dec'], height=5.5)

<seaborn.axisgrid.PairGrid at 0x7f693cebef28>

png

sns.pairplot(data=df.sample(30_000), vars=['redshift', 'ellipticity', 'size'], height=5.5)

<seaborn.axisgrid.PairGrid at 0x7f6939b9db38>

png

sns.heatmap(df.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x7f6937e67be0>

png

test_df = pd.read_csv('../input/test.csv')

test_df['ug'] = test_df['u'] - test_df['g']
test_df['gr'] = test_df['g'] - test_df['r']
test_df['ri'] = test_df['r'] - test_df['i']
test_df['iz'] = test_df['i'] - test_df['z']

sns.heatmap(np.abs(test_df.corr() - df.corr()))

<matplotlib.axes._subplots.AxesSubplot at 0x7f6937daf198>

png

relevant_variables = ['u', 'g', 'r', 'i', 'z', 'gr', 'ri', 'iz'] # and maybe ellipticity

from sklearn.preprocessing import normalize

X_under = df[df['ra'] < 130].sample(150_000)
X_over = df[df['ra'] >= 130].sample(150_000)

X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(
    normalize(X_over.drop(labels=['redshift', 'id'], axis=1)[relevant_variables]),
    X_over['redshift'], test_size=0.5, random_state=137)

X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(
    normalize(X_under.drop(labels=['redshift', 'id'], axis=1)[relevant_variables]),
    X_under['redshift'], test_size=0.5, random_state=137)

sns.heatmap(X_under.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x7f6937db2160>

png

sns.heatmap(df.sample(500_000).corr() - df.sample(500_000).corr()) # they are pretty much the same

<matplotlib.axes._subplots.AxesSubplot at 0x7f6936c25cf8>

png

sns.heatmap(X_under.corr() - X_over.corr(), vmin=-0.1, vmax=0.1) # they are NOT

<matplotlib.axes._subplots.AxesSubplot at 0x7f6936b59dd8>

png

On the correlation plots above it can be seen that mainly the previously chosen relevant_variables are correletad linearly with the redshift variable.

That’s it for today. I am currently working on a Python based Keras backend API for an image preditor frontend application but Heroke doesn’t seem to like me loading a pre-trained ResNet50. I’ll also bring some advanced usage of pre-trained models on histopathologic image samples.

@Regards, Alex

Alex Olar

Christian, foodie, physicist, tech enthusiast

Stacked boosting for photo-z estimation - a university Kaggle challenge

Stacked boosting for photo-z estimation - Kaggle challenge

@Regards, Alex

Related Posts

Alex Olar