LinearRegression with multiple binary features sometimes performs poorly
$begingroup$
I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()
) of categorical features. SalePrice is my target variable.
I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:
import numpy as np
scores = np.array()
for i in range(1000):
x3_train, x3_test, y3_train, y3_test = train_test_split(
df3.drop('SalePrice', axis=1),
df3.SalePrice,
test_size=0.33
)
lr3 = LinearRegression()
lr3.fit(x3_train, y3_train)
scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))
print(scores.mean())
Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:
0 5.907010e-01
1 6.044523e-01
2 5.178049e-01
3 5.622240e-01
4 5.810432e-01
5 5.131722e-01
6 5.772946e-01
7 4.674152e-01
8 4.962015e-01
9 4.887872e-01
10 5.144772e-01
11 5.676829e-01
12 5.122566e-01
13 5.453985e-01
14 5.355022e-01
15 5.888459e-01
16 5.552912e-01
17 5.615658e-01
18 5.472429e-01
19 5.810185e-01
20 5.334900e-01
21 5.493619e-01
22 5.567195e-01
23 5.514374e-01
24 4.916478e-01
25 4.580718e-01
26 5.286095e-01
27 5.761865e-01
28 5.638573e-01
29 -1.809208e+24
Name: lr3, dtype: float64
I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes
) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.
python scikit-learn linear-regression
$endgroup$
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()
) of categorical features. SalePrice is my target variable.
I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:
import numpy as np
scores = np.array()
for i in range(1000):
x3_train, x3_test, y3_train, y3_test = train_test_split(
df3.drop('SalePrice', axis=1),
df3.SalePrice,
test_size=0.33
)
lr3 = LinearRegression()
lr3.fit(x3_train, y3_train)
scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))
print(scores.mean())
Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:
0 5.907010e-01
1 6.044523e-01
2 5.178049e-01
3 5.622240e-01
4 5.810432e-01
5 5.131722e-01
6 5.772946e-01
7 4.674152e-01
8 4.962015e-01
9 4.887872e-01
10 5.144772e-01
11 5.676829e-01
12 5.122566e-01
13 5.453985e-01
14 5.355022e-01
15 5.888459e-01
16 5.552912e-01
17 5.615658e-01
18 5.472429e-01
19 5.810185e-01
20 5.334900e-01
21 5.493619e-01
22 5.567195e-01
23 5.514374e-01
24 4.916478e-01
25 4.580718e-01
26 5.286095e-01
27 5.761865e-01
28 5.638573e-01
29 -1.809208e+24
Name: lr3, dtype: float64
I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes
) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.
python scikit-learn linear-regression
$endgroup$
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()
) of categorical features. SalePrice is my target variable.
I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:
import numpy as np
scores = np.array()
for i in range(1000):
x3_train, x3_test, y3_train, y3_test = train_test_split(
df3.drop('SalePrice', axis=1),
df3.SalePrice,
test_size=0.33
)
lr3 = LinearRegression()
lr3.fit(x3_train, y3_train)
scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))
print(scores.mean())
Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:
0 5.907010e-01
1 6.044523e-01
2 5.178049e-01
3 5.622240e-01
4 5.810432e-01
5 5.131722e-01
6 5.772946e-01
7 4.674152e-01
8 4.962015e-01
9 4.887872e-01
10 5.144772e-01
11 5.676829e-01
12 5.122566e-01
13 5.453985e-01
14 5.355022e-01
15 5.888459e-01
16 5.552912e-01
17 5.615658e-01
18 5.472429e-01
19 5.810185e-01
20 5.334900e-01
21 5.493619e-01
22 5.567195e-01
23 5.514374e-01
24 4.916478e-01
25 4.580718e-01
26 5.286095e-01
27 5.761865e-01
28 5.638573e-01
29 -1.809208e+24
Name: lr3, dtype: float64
I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes
) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.
python scikit-learn linear-regression
$endgroup$
I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()
) of categorical features. SalePrice is my target variable.
I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:
import numpy as np
scores = np.array()
for i in range(1000):
x3_train, x3_test, y3_train, y3_test = train_test_split(
df3.drop('SalePrice', axis=1),
df3.SalePrice,
test_size=0.33
)
lr3 = LinearRegression()
lr3.fit(x3_train, y3_train)
scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))
print(scores.mean())
Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:
0 5.907010e-01
1 6.044523e-01
2 5.178049e-01
3 5.622240e-01
4 5.810432e-01
5 5.131722e-01
6 5.772946e-01
7 4.674152e-01
8 4.962015e-01
9 4.887872e-01
10 5.144772e-01
11 5.676829e-01
12 5.122566e-01
13 5.453985e-01
14 5.355022e-01
15 5.888459e-01
16 5.552912e-01
17 5.615658e-01
18 5.472429e-01
19 5.810185e-01
20 5.334900e-01
21 5.493619e-01
22 5.567195e-01
23 5.514374e-01
24 4.916478e-01
25 4.580718e-01
26 5.286095e-01
27 5.761865e-01
28 5.638573e-01
29 -1.809208e+24
Name: lr3, dtype: float64
I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes
) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.
python scikit-learn linear-regression
python scikit-learn linear-regression
asked Jan 12 at 23:18
Dan ScallyDan Scally
1011
1011
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 49 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.
$endgroup$
$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43897%2flinearregression-with-multiple-binary-features-sometimes-performs-poorly%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.
$endgroup$
$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31
add a comment |
$begingroup$
You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.
$endgroup$
$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31
add a comment |
$begingroup$
You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.
$endgroup$
You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.
edited Jan 13 at 6:14
answered Jan 13 at 2:00
Sridhar ThiagarajanSridhar Thiagarajan
286
286
$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31
add a comment |
$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31
$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31
$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43897%2flinearregression-with-multiple-binary-features-sometimes-performs-poorly%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown