During a regression task, I am getting low R^2 values, but elementwise difference between test set and...
$begingroup$
I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# load dataset
df = pd.read_csv('Dataset.csv')
# split into input (X) and output (Y) variables
X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
Y = df.TARGET_COLUMN
# Split the data into 67% for training and 33% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
print("Predicting Values:")
y_pred = regressor.predict(X_test)
print("Getting Model Performance...")
# Get regression scores
print("R^2 train = ", regressor.score(X_train, Y_train))
print("R^2 test = ", regressor.score(X_test, Y_test))
This outputs the following:
Predicting Values:
Getting Model Performance...
R^2 train = 0.9791000275450427
R^2 test = 0.8577464692386905
Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:
diff =
for i in range(len(y_pred)):
if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error
I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!
np.mean(diff)
>>> 49.07580695857447
So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.
I also checked the rmse score:
import math
rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
rmse
>>> 3.67328471827293
This seems quite high for the model to have done a good job, but please correct me if I'm wrong.
And I also checked the R^2 scores for different number of estimators:
import matplotlib.pyplot as plt
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores =
for n in estimators:
model.set_params(n_estimators=n)
model.fit(X_train, Y_train)
scores.append(model.score(X_test, Y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)
Please advise.
I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
# The coefficients
print('Coefficients: n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))
Coefficients:
[ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
1.95335035e+00]
Mean squared error: 86.92
Variance score: 0.08
machine-learning python predictive-modeling regression random-forest
$endgroup$
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# load dataset
df = pd.read_csv('Dataset.csv')
# split into input (X) and output (Y) variables
X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
Y = df.TARGET_COLUMN
# Split the data into 67% for training and 33% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
print("Predicting Values:")
y_pred = regressor.predict(X_test)
print("Getting Model Performance...")
# Get regression scores
print("R^2 train = ", regressor.score(X_train, Y_train))
print("R^2 test = ", regressor.score(X_test, Y_test))
This outputs the following:
Predicting Values:
Getting Model Performance...
R^2 train = 0.9791000275450427
R^2 test = 0.8577464692386905
Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:
diff =
for i in range(len(y_pred)):
if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error
I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!
np.mean(diff)
>>> 49.07580695857447
So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.
I also checked the rmse score:
import math
rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
rmse
>>> 3.67328471827293
This seems quite high for the model to have done a good job, but please correct me if I'm wrong.
And I also checked the R^2 scores for different number of estimators:
import matplotlib.pyplot as plt
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores =
for n in estimators:
model.set_params(n_estimators=n)
model.fit(X_train, Y_train)
scores.append(model.score(X_test, Y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)
Please advise.
I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
# The coefficients
print('Coefficients: n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))
Coefficients:
[ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
1.95335035e+00]
Mean squared error: 86.92
Variance score: 0.08
machine-learning python predictive-modeling regression random-forest
$endgroup$
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# load dataset
df = pd.read_csv('Dataset.csv')
# split into input (X) and output (Y) variables
X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
Y = df.TARGET_COLUMN
# Split the data into 67% for training and 33% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
print("Predicting Values:")
y_pred = regressor.predict(X_test)
print("Getting Model Performance...")
# Get regression scores
print("R^2 train = ", regressor.score(X_train, Y_train))
print("R^2 test = ", regressor.score(X_test, Y_test))
This outputs the following:
Predicting Values:
Getting Model Performance...
R^2 train = 0.9791000275450427
R^2 test = 0.8577464692386905
Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:
diff =
for i in range(len(y_pred)):
if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error
I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!
np.mean(diff)
>>> 49.07580695857447
So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.
I also checked the rmse score:
import math
rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
rmse
>>> 3.67328471827293
This seems quite high for the model to have done a good job, but please correct me if I'm wrong.
And I also checked the R^2 scores for different number of estimators:
import matplotlib.pyplot as plt
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores =
for n in estimators:
model.set_params(n_estimators=n)
model.fit(X_train, Y_train)
scores.append(model.score(X_test, Y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)
Please advise.
I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
# The coefficients
print('Coefficients: n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))
Coefficients:
[ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
1.95335035e+00]
Mean squared error: 86.92
Variance score: 0.08
machine-learning python predictive-modeling regression random-forest
$endgroup$
I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# load dataset
df = pd.read_csv('Dataset.csv')
# split into input (X) and output (Y) variables
X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
Y = df.TARGET_COLUMN
# Split the data into 67% for training and 33% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
print("Predicting Values:")
y_pred = regressor.predict(X_test)
print("Getting Model Performance...")
# Get regression scores
print("R^2 train = ", regressor.score(X_train, Y_train))
print("R^2 test = ", regressor.score(X_test, Y_test))
This outputs the following:
Predicting Values:
Getting Model Performance...
R^2 train = 0.9791000275450427
R^2 test = 0.8577464692386905
Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:
diff =
for i in range(len(y_pred)):
if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error
I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!
np.mean(diff)
>>> 49.07580695857447
So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.
I also checked the rmse score:
import math
rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
rmse
>>> 3.67328471827293
This seems quite high for the model to have done a good job, but please correct me if I'm wrong.
And I also checked the R^2 scores for different number of estimators:
import matplotlib.pyplot as plt
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores =
for n in estimators:
model.set_params(n_estimators=n)
model.fit(X_train, Y_train)
scores.append(model.score(X_test, Y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)
Please advise.
I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
# The coefficients
print('Coefficients: n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))
Coefficients:
[ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
1.95335035e+00]
Mean squared error: 86.92
Variance score: 0.08
machine-learning python predictive-modeling regression random-forest
machine-learning python predictive-modeling regression random-forest
asked Nov 29 '18 at 9:01
Kristada673Kristada673
1715
1715
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
This line looks wrong to me:
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])
Shouldn't the abs be around the entire calculation?
diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))
That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.
One other thought, have you checked for outliers? This could affect some measures and not others as drastically.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41842%2fduring-a-regression-task-i-am-getting-low-r2-values-but-elementwise-differenc%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
This line looks wrong to me:
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])
Shouldn't the abs be around the entire calculation?
diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))
That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.
One other thought, have you checked for outliers? This could affect some measures and not others as drastically.
$endgroup$
add a comment |
$begingroup$
This line looks wrong to me:
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])
Shouldn't the abs be around the entire calculation?
diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))
That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.
One other thought, have you checked for outliers? This could affect some measures and not others as drastically.
$endgroup$
add a comment |
$begingroup$
This line looks wrong to me:
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])
Shouldn't the abs be around the entire calculation?
diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))
That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.
One other thought, have you checked for outliers? This could affect some measures and not others as drastically.
$endgroup$
This line looks wrong to me:
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])
Shouldn't the abs be around the entire calculation?
diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))
That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.
One other thought, have you checked for outliers? This could affect some measures and not others as drastically.
answered Nov 29 '18 at 19:24
SkiddlesSkiddles
700210
700210
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41842%2fduring-a-regression-task-i-am-getting-low-r2-values-but-elementwise-differenc%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown