Deep learning / neural network has become a hype and everyone wants to give a try it for better model performance. However, the embarrassed thing is that the neural network sometimes (maybe a lot of times) is not even better than a primitive linear model baseline. I encountered this problem when I tried to do a benchmark of neural network with simple linear model. One of the reasons is that neural network is sensitive to the input distributions and scales, which are not usually addressed when fitting linear models or tree-based models.

Normalization (centred and scaled to unit variance) of input feature space should be able to fix the part of the problem, but it can’t help with deep neural network which could have an effect of “internal co-variate shift” during training. Batch normalization layer is introduced to normalize the output from neural network layers to achieve a faster and stabler convergence.

In this post, I compared a neural network model with linear regression baseline and show how batch normalization layer speeds up the training process.

Data

Boston housing data (can be downloaded from http://lib.stat.cmu.edu/datasets/boston) was used in this demonstration of regression task. I didn’t find a good way to directly load this dataset, so some hacks were used to parse the data.

The columns are defined in the data dictionary:

  1. CRIM – per capita crime rate by town
  2. ZN – proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS – proportion of non-retail business acres per town.
  4. CHAS – Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX – nitric oxides concentration (parts per 10 million)
  6. RM – average number of rooms per dwelling
  7. AGE – proportion of owner-occupied units built prior to 1940
  8. DIS – weighted distances to five Boston employment centres
  9. RAD – index of accessibility to radial highways
  10. TAX – full-value property-tax rate per $10,000
  11. PTRATIO – pupil-teacher ratio by town
  12. B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT – % lower status of the population
  14. MEDV – Median value of owner-occupied homes in $1000's

MEDV is the target variable we are going to predict

Parse data

# load the data
res = pd.read_fwf('./boston.txt', skiprows = 22, header = None)
rows_odd = res.loc[np.arange(0,len(res),2)].reset_index(drop = True)
rows_even = res.loc[np.arange(1,len(res),2)].reset_index(drop = True)
res = pd.concat([rows_odd, rows_even], axis=1).iloc[:,:14]
res.columns = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS',
'RAD','TAX','PTRATIO','B1000','LSTAT','MEDV']
res.head()

boston_data

Split train & valid data

train, valid = train_test_split(res, train_size = 0.8)
print('train set: ' + str(train.shape))
print('valid set: ' + str(valid.shape))
train set: (404, 14)
valid set: (102, 14)
train = train.iloc[:,:13].values, train.iloc[:,13].values
valid = valid.iloc[:,:13].values, valid.iloc[:,13].values

Linear regression model – Baseline

Let’s define function to score models using mean absolute error (MAE), mean squared error (MSE) and Pearson correlation, which can be used for subsequent models and save us some typing.

def score_model(model, valid):
x, y = valid
preds = model.predict(x).reshape(-1)
label = y
mae = mean_absolute_error(label, preds)
mse = mean_squared_error(label, preds)
cor = pearsonr(label, preds)[0]
return mae, mse, cor

Then we simply train and evaluate vanilla linear model from sklearn.

model_lm = LinearRegression()
model_lm.fit(X = train[0], y = train[1])
print('linear model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_lm, valid)))
linear model => mae: 3.79, mse: 30.17, cor :0.84

This is a simple linear regression model without regularization. Validation mean absolute error is 3.45, which is baseline of our model.

Single layer neural network

Next, a single perceptron is used for a fair comparison with simple linear model.

model_mlp = Sequential()
model_mlp.add(Dense(1, kernel_initializer='normal',activation='linear', input_dim = 13))
model_mlp.compile('adam','mean_squared_error')
model_mlp.summary()
_________________________________________________________________ 
Layer (type)           Output Shape                    Param # 
================================================================= 
dense_8                (Dense) (None, 1)               14 
================================================================= 
Total params: 14 
Trainable params: 14 
Non-trainable params: 0 
_________________________________________________________________

In the first 200 epochs of training, the model performance is disappointingly low (MAE @ 5.32)

model_mlp.fit(x = train[0], y = train[1],
epochs = 200, batch_size = 32,
verbose=0)
print('mlp model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp, valid)))
mlp model => mae: 5.08, mse: 51.39, cor :0.73

When we fit another 2000 epochs, the score was getting closer the linear model baseline, which implies that the model is as capable as a linear model, but just the convergence is extremely slow.

model_mlp.fit(x = train[0], y = train[1],
epochs = 2000, batch_size = 32,
verbose=0)
print('mlp model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp, valid)))
mlp model => mae: 3.77, mse: 32.89, cor :0.83

Neural networks with batch normalization

Now, we define another two functions to add batch normalization layers after dense layers, to test different model scenarios.

def build_mlp(hidden_dim = None, bn = False):

    input_layer = Input((13,), name = 'input')

    if bn: # add batch normlization layer
        out = BatchNormalization()(input_layer)
    else:
        out = input_layer

    if hidden_dim != None:
        out = Dense(hidden_dim,
                    activation='relu')(out)
    if bn:
        out = BatchNormalization()(out)

    out = Dense(1, kernel_initializer='normal',
                activation='linear')(out)
    model = Model(input_layer, out)
    model.compile('adam','mean_squared_error',['mae'])
    return model

def train_model(model, train, valid, epochs, batch_size, model_name):

    es = EarlyStopping(patience=5)
    tb = TensorBoard(log_dir='./tensorboard/'+ model_name)
    callbacks = [es, tb]

    model.fit(x=train[0], y = train[1],
    batch_size=batch_size,
    verbose = 0,
    epochs=epochs,
    callbacks = callbacks,
    shuffle=True,
    validation_data = valid)
    return model

let’s run some scenarios to examine the model performance with fixed number of epochs.

# define models for different scenarios
model_mlp_baseline = build_mlp()    # baseline neural network
model_mlp_bn = build_mlp(bn = True)   # neural network + batch normalization
model_mlp_bn_h16 = build_mlp(hidden_dim=16, bn = True)    # extra hidden layer
model_mlp_bn_h64 = build_mlp(hidden_dim=64, bn = True)    # extra wider hidden layer

epochs = 200
batch_size = 16

# train all the models
model_mlp_baseline = train_model(model_mlp_baseline, train, valid, epochs, batch_size, 'mlp_baseline')
model_mlp_bn = train_model(model_mlp_bn, train, valid, epochs, batch_size, 'mlp_bn')
model_mlp_bn_h16 = train_model(model_mlp_bn_h16, train, valid, epochs, batch_size, 'mlp_bn_h16')
model_mlp_bn_h64 = train_model(model_mlp_bn_h64, train, valid, epochs, batch_size, 'mlp_bn_h64')

# print model performance
print('mlp baseline model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_baseline, valid)))
print('mlp model with batch normalization => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_bn, valid)))
print('mlp model with batch normalization + extra hidden layer => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_bn_h16, valid)))
print('mlp model with batch normalization + wider hidden layer => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_bn_h64, valid)))
mlp baseline model => mae: 5.79, mse: 69.34, cor :0.61
mlp model with batch normalization => mae: 3.46, mse: 30.12, cor :0.84
mlp model with batch normalization + extra hidden layer => mae: 3.11, mse: 20.40, cor :0.91
mlp model with batch normalization + wider hidden layer => mae: 2.86, mse: 18.56, cor :0.92

tensorboard_mae

With the linear model baseline, MAE@ 3.45:

  • NN baseline: premature stopped (because of early stop callback) with MAE @ 5.79
  • NN+ Batch Normalization: stably converged with MAE @ 3.46

Increase the complexity of NN, on top with batch normalization:

  • NN+ BN + hidden 16-dimension layer: MAE @ 3.11
  • NN+ BN + hidden 64-dimension layer: MAE @ 2.86

Just an additional hidden dense layer makes neural network outperforms the linear model, with help of batch normalization layers.

Conclusion

Neural network without normalization generally perform badly and takes many epochs to reach comparable performance with simple linear model. To improve that, batch normalization clearly make the neural network stable and faster and it’s essential for deep neural network to speed up the model convergence.

Jupyter notebook is available at here for reference.

Reference