If we visit the evaluation section of the Kaggle competition, the evaluation metric is defined as the RMSLE. In the competition, the objective is to minimize this metric for the test data. An error is simply the difference between actual values and predicted values:
error = predicted value - actual value
The Root Mean Squared Error (RMSE) would literally be the square root applied over the mean of all the squared error terms for each observation.
However, our metric in the Kaggle competition needs to be a log error:
log_error = log(predicted value + 1) - log(actual value + 1)
Therefore, it is important to apply a log transform over the trip_duration column as we did earlier:
df["trip_duration"] = np.log(df["trip_duration"] + 1)
Now, we can use a function that can calculate RMSE rather a function that calculates RMSLE:
import math
def rmse(x,y...