Deep learning / neural network has become a hype and everyone wants to give a try it for better model performance. However, the embarrassed thing is that the neural network sometimes (maybe a lot of times) is not even better than a primitive linear model baseline. I encountered this problem when I tried to do a benchmark of neural network with simple linear model. One of the reasons is that neural network is sensitive to the input distributions and scales, which are not usually addressed when fitting linear models or tree-based models.
Normalization (centred and scaled to unit variance) of input feature space should be able to fix the part of the problem, but it can’t help with deep neural network which could have an effect of “internal co-variate shift” during training. Batch normalization layer is introduced to normalize the output from neural network layers to achieve a faster and stabler convergence.
In this post, I compared a neural network model with linear regression baseline and show how batch normalization layer speeds up the training process.
Data
Boston housing data (can be downloaded from http://lib.stat.cmu.edu/datasets/boston) was used in this demonstration of regression task. I didn’t find a good way to directly load this dataset, so some hacks were used to parse the data.
The columns are defined in the data dictionary:
- CRIM – per capita crime rate by town
- ZN – proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS – proportion of non-retail business acres per town.
- CHAS – Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX – nitric oxides concentration (parts per 10 million)
- RM – average number of rooms per dwelling
- AGE – proportion of owner-occupied units built prior to 1940
- DIS – weighted distances to five Boston employment centres
- RAD – index of accessibility to radial highways
- TAX – full-value property-tax rate per
$10,000
- PTRATIO – pupil-teacher ratio by town
- B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT – % lower status of the population
- MEDV – Median value of owner-occupied homes in
$1000's
MEDV
is the target variable we are going to predict
Parse data
# load the data res = pd.read_fwf('./boston.txt', skiprows = 22, header = None) rows_odd = res.loc[np.arange(0,len(res),2)].reset_index(drop = True) rows_even = res.loc[np.arange(1,len(res),2)].reset_index(drop = True) res = pd.concat([rows_odd, rows_even], axis=1).iloc[:,:14] res.columns = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS', 'RAD','TAX','PTRATIO','B1000','LSTAT','MEDV'] res.head()
Split train & valid data
train, valid = train_test_split(res, train_size = 0.8) print('train set: ' + str(train.shape)) print('valid set: ' + str(valid.shape))
train set: (404, 14) valid set: (102, 14)
train = train.iloc[:,:13].values, train.iloc[:,13].values valid = valid.iloc[:,:13].values, valid.iloc[:,13].values
Linear regression model – Baseline
Let’s define function to score models using mean absolute error (MAE), mean squared error (MSE) and Pearson correlation, which can be used for subsequent models and save us some typing.
def score_model(model, valid): x, y = valid preds = model.predict(x).reshape(-1) label = y mae = mean_absolute_error(label, preds) mse = mean_squared_error(label, preds) cor = pearsonr(label, preds)[0] return mae, mse, cor
Then we simply train and evaluate vanilla linear model from sklearn.
model_lm = LinearRegression() model_lm.fit(X = train[0], y = train[1]) print('linear model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%( score_model(model_lm, valid)))
This is a simple linear regression model without regularization. Validation mean absolute error is 3.45, which is baseline of our model.
Single layer neural network
Next, a single perceptron is used for a fair comparison with simple linear model.
model_mlp = Sequential() model_mlp.add(Dense(1, kernel_initializer='normal',activation='linear', input_dim = 13)) model_mlp.compile('adam','mean_squared_error') model_mlp.summary()
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_8 (Dense) (None, 1) 14 ================================================================= Total params: 14 Trainable params: 14 Non-trainable params: 0 _________________________________________________________________
In the first 200 epochs of training, the model performance is disappointingly low (MAE @ 5.32)
model_mlp.fit(x = train[0], y = train[1], epochs = 200, batch_size = 32, verbose=0) print('mlp model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%( score_model(model_mlp, valid)))
When we fit another 2000 epochs, the score was getting closer the linear model baseline, which implies that the model is as capable as a linear model, but just the convergence is extremely slow.
model_mlp.fit(x = train[0], y = train[1], epochs = 2000, batch_size = 32, verbose=0) print('mlp model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%( score_model(model_mlp, valid)))
Neural networks with batch normalization
Now, we define another two functions to add batch normalization layers after dense layers, to test different model scenarios.
def build_mlp(hidden_dim = None, bn = False): input_layer = Input((13,), name = 'input') if bn: # add batch normlization layer out = BatchNormalization()(input_layer) else: out = input_layer if hidden_dim != None: out = Dense(hidden_dim, activation='relu')(out) if bn: out = BatchNormalization()(out) out = Dense(1, kernel_initializer='normal', activation='linear')(out) model = Model(input_layer, out) model.compile('adam','mean_squared_error',['mae']) return model def train_model(model, train, valid, epochs, batch_size, model_name): es = EarlyStopping(patience=5) tb = TensorBoard(log_dir='./tensorboard/'+ model_name) callbacks = [es, tb] model.fit(x=train[0], y = train[1], batch_size=batch_size, verbose = 0, epochs=epochs, callbacks = callbacks, shuffle=True, validation_data = valid) return model
let’s run some scenarios to examine the model performance with fixed number of epochs.
# define models for different scenarios model_mlp_baseline = build_mlp() # baseline neural network model_mlp_bn = build_mlp(bn = True) # neural network + batch normalization model_mlp_bn_h16 = build_mlp(hidden_dim=16, bn = True) # extra hidden layer model_mlp_bn_h64 = build_mlp(hidden_dim=64, bn = True) # extra wider hidden layer epochs = 200 batch_size = 16 # train all the models model_mlp_baseline = train_model(model_mlp_baseline, train, valid, epochs, batch_size, 'mlp_baseline') model_mlp_bn = train_model(model_mlp_bn, train, valid, epochs, batch_size, 'mlp_bn') model_mlp_bn_h16 = train_model(model_mlp_bn_h16, train, valid, epochs, batch_size, 'mlp_bn_h16') model_mlp_bn_h64 = train_model(model_mlp_bn_h64, train, valid, epochs, batch_size, 'mlp_bn_h64') # print model performance print('mlp baseline model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%( score_model(model_mlp_baseline, valid))) print('mlp model with batch normalization => mae: %3.2f, mse: %3.2f, cor :%3.2f'%( score_model(model_mlp_bn, valid))) print('mlp model with batch normalization + extra hidden layer => mae: %3.2f, mse: %3.2f, cor :%3.2f'%( score_model(model_mlp_bn_h16, valid))) print('mlp model with batch normalization + wider hidden layer => mae: %3.2f, mse: %3.2f, cor :%3.2f'%( score_model(model_mlp_bn_h64, valid)))
mlp baseline model => mae: 5.79, mse: 69.34, cor :0.61 mlp model with batch normalization => mae: 3.46, mse: 30.12, cor :0.84 mlp model with batch normalization + extra hidden layer => mae: 3.11, mse: 20.40, cor :0.91 mlp model with batch normalization + wider hidden layer => mae: 2.86, mse: 18.56, cor :0.92
With the linear model baseline, MAE@ 3.45:
- NN baseline: premature stopped (because of early stop callback) with MAE @ 5.79
- NN+ Batch Normalization: stably converged with MAE @ 3.46
Increase the complexity of NN, on top with batch normalization:
- NN+ BN + hidden 16-dimension layer: MAE @ 3.11
- NN+ BN + hidden 64-dimension layer: MAE @ 2.86
Just an additional hidden dense layer makes neural network outperforms the linear model, with help of batch normalization layers.
Conclusion
Neural network without normalization generally perform badly and takes many epochs to reach comparable performance with simple linear model. To improve that, batch normalization clearly make the neural network stable and faster and it’s essential for deep neural network to speed up the model convergence.
Jupyter notebook is available at here for reference.