During a regression task, I am getting low R^2 values, but elementwise difference between test set and...

I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split



# load dataset

df = pd.read_csv('Dataset.csv')



# split into input (X) and output (Y) variables

X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)

Y = df.TARGET_COLUMN



# Split the data into 67% for training and 33% for testing

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)



# Fitting the regression model to the dataset

regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)

regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message





print("Predicting Values:")

y_pred = regressor.predict(X_test)



print("Getting Model Performance...")



# Get regression scores

print("R^2 train = ", regressor.score(X_train, Y_train))

print("R^2 test = ", regressor.score(X_test, Y_test))

This outputs the following:

Predicting Values:

Getting Model Performance...

R^2 train =  0.9791000275450427

R^2 test = 0.8577464692386905

Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:

diff = 

for i in range(len(y_pred)):

    if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf

        diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error

I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!

np.mean(diff)

>>> 49.07580695857447

So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.

I also checked the rmse score:

import math

rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))

rmse

>>> 3.67328471827293

This seems quite high for the model to have done a good job, but please correct me if I'm wrong.

And I also checked the R^2 scores for different number of estimators:

import matplotlib.pyplot as plt

model = RandomForestRegressor(n_jobs=-1)

# Try different numbers of n_estimators

estimators = np.arange(10, 200, 10)

scores = 

for n in estimators:

    model.set_params(n_estimators=n)

    model.fit(X_train, Y_train)

    scores.append(model.score(X_test, Y_test))

plt.title("Effect of n_estimators")

plt.xlabel("n_estimator")

plt.ylabel("score")

plt.plot(estimators, scores)

enter image description here

Please advise.

I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score



lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)



# The coefficients

print('Coefficients: n', lr.coef_)

# The mean squared error

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(y_test, y_pred))





Coefficients: 

 [ 1.93829229e-01 -4.68738825e-01  2.01635420e-01  6.35902010e-01

  6.57354434e-03  5.13180293e-03  2.84015810e-01 -1.31469084e-06

  1.95335035e+00]

Mean squared error: 86.92

Variance score: 0.08

asked Nov 29 '18 at 9:01

Kristada673

1715

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split



# load dataset

df = pd.read_csv('Dataset.csv')



# split into input (X) and output (Y) variables

X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)

Y = df.TARGET_COLUMN



# Split the data into 67% for training and 33% for testing

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)



# Fitting the regression model to the dataset

regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)

regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message





print("Predicting Values:")

y_pred = regressor.predict(X_test)



print("Getting Model Performance...")



# Get regression scores

print("R^2 train = ", regressor.score(X_train, Y_train))

print("R^2 test = ", regressor.score(X_test, Y_test))

This outputs the following:

Predicting Values:

Getting Model Performance...

R^2 train =  0.9791000275450427

R^2 test = 0.8577464692386905

Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:

diff = 

for i in range(len(y_pred)):

    if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf

        diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error

I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!

np.mean(diff)

>>> 49.07580695857447

I also checked the rmse score:

import math

rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))

rmse

>>> 3.67328471827293

This seems quite high for the model to have done a good job, but please correct me if I'm wrong.

And I also checked the R^2 scores for different number of estimators:

import matplotlib.pyplot as plt

model = RandomForestRegressor(n_jobs=-1)

# Try different numbers of n_estimators

estimators = np.arange(10, 200, 10)

scores = 

for n in estimators:

    model.set_params(n_estimators=n)

    model.fit(X_train, Y_train)

    scores.append(model.score(X_test, Y_test))

plt.title("Effect of n_estimators")

plt.xlabel("n_estimator")

plt.ylabel("score")

plt.plot(estimators, scores)

enter image description here

Please advise.

I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score



lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)



# The coefficients

print('Coefficients: n', lr.coef_)

# The mean squared error

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(y_test, y_pred))





Coefficients: 

 [ 1.93829229e-01 -4.68738825e-01  2.01635420e-01  6.35902010e-01

  6.57354434e-03  5.13180293e-03  2.84015810e-01 -1.31469084e-06

  1.95335035e+00]

Mean squared error: 86.92

Variance score: 0.08

asked Nov 29 '18 at 9:01

Kristada673

1715

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split



# load dataset

df = pd.read_csv('Dataset.csv')



# split into input (X) and output (Y) variables

X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)

Y = df.TARGET_COLUMN



# Split the data into 67% for training and 33% for testing

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)



# Fitting the regression model to the dataset

regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)

regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message





print("Predicting Values:")

y_pred = regressor.predict(X_test)



print("Getting Model Performance...")



# Get regression scores

print("R^2 train = ", regressor.score(X_train, Y_train))

print("R^2 test = ", regressor.score(X_test, Y_test))

This outputs the following:

Predicting Values:

Getting Model Performance...

R^2 train =  0.9791000275450427

R^2 test = 0.8577464692386905

Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:

diff = 

for i in range(len(y_pred)):

    if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf

        diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error

I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!

np.mean(diff)

>>> 49.07580695857447

I also checked the rmse score:

import math

rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))

rmse

>>> 3.67328471827293

This seems quite high for the model to have done a good job, but please correct me if I'm wrong.

And I also checked the R^2 scores for different number of estimators:

import matplotlib.pyplot as plt

model = RandomForestRegressor(n_jobs=-1)

# Try different numbers of n_estimators

estimators = np.arange(10, 200, 10)

scores = 

for n in estimators:

    model.set_params(n_estimators=n)

    model.fit(X_train, Y_train)

    scores.append(model.score(X_test, Y_test))

plt.title("Effect of n_estimators")

plt.xlabel("n_estimator")

plt.ylabel("score")

plt.plot(estimators, scores)

enter image description here

Please advise.

I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score



lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)



# The coefficients

print('Coefficients: n', lr.coef_)

# The mean squared error

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(y_test, y_pred))





Coefficients: 

 [ 1.93829229e-01 -4.68738825e-01  2.01635420e-01  6.35902010e-01

  6.57354434e-03  5.13180293e-03  2.84015810e-01 -1.31469084e-06

  1.95335035e+00]

Mean squared error: 86.92

Variance score: 0.08

asked Nov 29 '18 at 9:01

Kristada673

1715

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split



# load dataset

df = pd.read_csv('Dataset.csv')



# split into input (X) and output (Y) variables

X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)

Y = df.TARGET_COLUMN



# Split the data into 67% for training and 33% for testing

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)



# Fitting the regression model to the dataset

regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)

regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message





print("Predicting Values:")

y_pred = regressor.predict(X_test)



print("Getting Model Performance...")



# Get regression scores

print("R^2 train = ", regressor.score(X_train, Y_train))

print("R^2 test = ", regressor.score(X_test, Y_test))

This outputs the following:

Predicting Values:

Getting Model Performance...

R^2 train =  0.9791000275450427

R^2 test = 0.8577464692386905

Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:

diff = 

for i in range(len(y_pred)):

    if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf

        diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error

I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!

np.mean(diff)

>>> 49.07580695857447

I also checked the rmse score:

import math

rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))

rmse

>>> 3.67328471827293

This seems quite high for the model to have done a good job, but please correct me if I'm wrong.

And I also checked the R^2 scores for different number of estimators:

import matplotlib.pyplot as plt

model = RandomForestRegressor(n_jobs=-1)

# Try different numbers of n_estimators

estimators = np.arange(10, 200, 10)

scores = 

for n in estimators:

    model.set_params(n_estimators=n)

    model.fit(X_train, Y_train)

    scores.append(model.score(X_test, Y_test))

plt.title("Effect of n_estimators")

plt.xlabel("n_estimator")

plt.ylabel("score")

plt.plot(estimators, scores)

enter image description here

Please advise.

I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score



lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)



# The coefficients

print('Coefficients: n', lr.coef_)

# The mean squared error

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

# Explained variance score: 1 is perfect prediction

print('Variance score: %.2f' % r2_score(y_test, y_pred))





Coefficients: 

 [ 1.93829229e-01 -4.68738825e-01  2.01635420e-01  6.35902010e-01

  6.57354434e-03  5.13180293e-03  2.84015810e-01 -1.31469084e-06

  1.95335035e+00]

Mean squared error: 86.92

Variance score: 0.08

machine-learning python predictive-modeling regression random-forest

asked Nov 29 '18 at 9:01

Kristada673

1715

asked Nov 29 '18 at 9:01

Kristada673

1715

asked Nov 29 '18 at 9:01

Kristada673

1715

asked Nov 29 '18 at 9:01

Kristada673

1715

asked Nov 29 '18 at 9:01

Kristada673

1715

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

This line looks wrong to me:

diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])

Shouldn't the abs be around the entire calculation?

diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))

That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.

One other thought, have you checked for outliers? This could affect some measures and not others as drastically.

answered Nov 29 '18 at 19:24

Skiddles

700210

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41842%2fduring-a-regression-task-i-am-getting-low-r2-values-but-elementwise-differenc%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

This line looks wrong to me:

diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])

Shouldn't the abs be around the entire calculation?

diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))

One other thought, have you checked for outliers? This could affect some measures and not others as drastically.

answered Nov 29 '18 at 19:24

Skiddles

700210

add a comment |

This line looks wrong to me:

diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])

Shouldn't the abs be around the entire calculation?

diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))

One other thought, have you checked for outliers? This could affect some measures and not others as drastically.

answered Nov 29 '18 at 19:24

Skiddles

700210

add a comment |

This line looks wrong to me:

diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])

Shouldn't the abs be around the entire calculation?

diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))

One other thought, have you checked for outliers? This could affect some measures and not others as drastically.

answered Nov 29 '18 at 19:24

Skiddles

700210

This line looks wrong to me:

diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])

Shouldn't the abs be around the entire calculation?

diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))

One other thought, have you checked for outliers? This could affect some measures and not others as drastically.

answered Nov 29 '18 at 19:24

Skiddles

700210

answered Nov 29 '18 at 19:24

Skiddles

700210

answered Nov 29 '18 at 19:24

Skiddles

700210

answered Nov 29 '18 at 19:24

Skiddles

700210

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

G zEX vt CDmSA9 oUVGNKDB DYpJbqbMO6QX43jP 8Nhc9UYF

搜尋此網誌

Gfyuki