Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted validation curve #51

Open
aabk-bkaa opened this issue Aug 25, 2020 · 1 comment
Open

Inverted validation curve #51

aabk-bkaa opened this issue Aug 25, 2020 · 1 comment

Comments

@aabk-bkaa
Copy link

aabk-bkaa commented Aug 25, 2020

After fitting our model it appears that our validation curve is inverted:

image

The validation RMSE is systematically lower than the training RMSE which does not make intuitive sense to us.

The modelling was produced with the following code:

`
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=1)

lambdas = np.logspace(0, 8, 12)

folds = KFold(n_splits = 5)
MSE_list =[]

for _lambda in tqdm(lambdas):
pipe_preproc = make_pipeline(PolynomialFeatures(2),StandardScaler(),
Lasso(alpha = _lambda, max_iter = 1000))
MSE_train = []
MSE_list_intermediate = []

for train_index, val_index in tqdm(folds.split(X_train,y_train)):
    
    X_tr, y_tr = X_train.iloc[train_index], y_train.iloc[train_index]
    X_val, y_val = X_train.iloc[val_index], y_train.iloc[val_index]

    MSE_list_intermediate.append(mse(y_val,pipe_preproc.fit(X_tr,y_tr).predict(X_val))**(1/2))
    
    MSE_train.append(mse(y_train,pipe_preproc.fit(X_tr,y_tr).predict(X_train))**(1/2))

MSE_list.append([_lambda] + MSE_list_intermediate + [np.mean(MSE_list_intermediate)] + [np.mean(MSE_train)])

MSE = pd.DataFrame(MSE_list)
MSE.columns = ["Lambda", "Fold 1", "Fold 2","Fold 3","Fold 4","Fold 5","Mean_RMSE", "Mean_RMSE_Evaluation"]

MSE.to_excel("LASSO_output.xlsx")
`

Can anybody help us.

Kind regards Anton and Søren

@jsr-p
Copy link
Collaborator

jsr-p commented Aug 25, 2020

hi @aabk-bkaa,
assuming that you did not plot the data and label the curves incorrectly, there could be other reasons for the RMSE being lower on the validation data than on the training data.
See:
https://stats.stackexchange.com/questions/187335/validation-error-less-than-training-error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants