LinearRegression with multiple binary features sometimes performs poorly

I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.

My dataset

I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:

import numpy as np

scores = np.array()



for i in range(1000):



    x3_train, x3_test, y3_train, y3_test = train_test_split(

            df3.drop('SalePrice', axis=1),

            df3.SalePrice,

            test_size=0.33

    )



    lr3 = LinearRegression()

    lr3.fit(x3_train, y3_train)



    scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))



print(scores.mean())

Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:

0     5.907010e-01

1     6.044523e-01

2     5.178049e-01

3     5.622240e-01

4     5.810432e-01

5     5.131722e-01

6     5.772946e-01

7     4.674152e-01

8     4.962015e-01

9     4.887872e-01

10    5.144772e-01

11    5.676829e-01

12    5.122566e-01

13    5.453985e-01

14    5.355022e-01

15    5.888459e-01

16    5.552912e-01

17    5.615658e-01

18    5.472429e-01

19    5.810185e-01

20    5.334900e-01

21    5.493619e-01

22    5.567195e-01

23    5.514374e-01

24    4.916478e-01

25    4.580718e-01

26    5.286095e-01

27    5.761865e-01

28    5.638573e-01

29   -1.809208e+24

Name: lr3, dtype: float64

I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.

asked Jan 12 at 23:18

Dan Scally

1011

bumped to the homepage by Community♦ 49 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.

My dataset

import numpy as np

scores = np.array()



for i in range(1000):



    x3_train, x3_test, y3_train, y3_test = train_test_split(

            df3.drop('SalePrice', axis=1),

            df3.SalePrice,

            test_size=0.33

    )



    lr3 = LinearRegression()

    lr3.fit(x3_train, y3_train)



    scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))



print(scores.mean())

0     5.907010e-01

1     6.044523e-01

2     5.178049e-01

3     5.622240e-01

4     5.810432e-01

5     5.131722e-01

6     5.772946e-01

7     4.674152e-01

8     4.962015e-01

9     4.887872e-01

10    5.144772e-01

11    5.676829e-01

12    5.122566e-01

13    5.453985e-01

14    5.355022e-01

15    5.888459e-01

16    5.552912e-01

17    5.615658e-01

18    5.472429e-01

19    5.810185e-01

20    5.334900e-01

21    5.493619e-01

22    5.567195e-01

23    5.514374e-01

24    4.916478e-01

25    4.580718e-01

26    5.286095e-01

27    5.761865e-01

28    5.638573e-01

29   -1.809208e+24

Name: lr3, dtype: float64

asked Jan 12 at 23:18

Dan Scally

1011

bumped to the homepage by Community♦ 49 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.

My dataset

import numpy as np

scores = np.array()



for i in range(1000):



    x3_train, x3_test, y3_train, y3_test = train_test_split(

            df3.drop('SalePrice', axis=1),

            df3.SalePrice,

            test_size=0.33

    )



    lr3 = LinearRegression()

    lr3.fit(x3_train, y3_train)



    scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))



print(scores.mean())

0     5.907010e-01

1     6.044523e-01

2     5.178049e-01

3     5.622240e-01

4     5.810432e-01

5     5.131722e-01

6     5.772946e-01

7     4.674152e-01

8     4.962015e-01

9     4.887872e-01

10    5.144772e-01

11    5.676829e-01

12    5.122566e-01

13    5.453985e-01

14    5.355022e-01

15    5.888459e-01

16    5.552912e-01

17    5.615658e-01

18    5.472429e-01

19    5.810185e-01

20    5.334900e-01

21    5.493619e-01

22    5.567195e-01

23    5.514374e-01

24    4.916478e-01

25    4.580718e-01

26    5.286095e-01

27    5.761865e-01

28    5.638573e-01

29   -1.809208e+24

Name: lr3, dtype: float64

asked Jan 12 at 23:18

Dan Scally

1011

I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.

My dataset

import numpy as np

scores = np.array()



for i in range(1000):



    x3_train, x3_test, y3_train, y3_test = train_test_split(

            df3.drop('SalePrice', axis=1),

            df3.SalePrice,

            test_size=0.33

    )



    lr3 = LinearRegression()

    lr3.fit(x3_train, y3_train)



    scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))



print(scores.mean())

0     5.907010e-01

1     6.044523e-01

2     5.178049e-01

3     5.622240e-01

4     5.810432e-01

5     5.131722e-01

6     5.772946e-01

7     4.674152e-01

8     4.962015e-01

9     4.887872e-01

10    5.144772e-01

11    5.676829e-01

12    5.122566e-01

13    5.453985e-01

14    5.355022e-01

15    5.888459e-01

16    5.552912e-01

17    5.615658e-01

18    5.472429e-01

19    5.810185e-01

20    5.334900e-01

21    5.493619e-01

22    5.567195e-01

23    5.514374e-01

24    4.916478e-01

25    4.580718e-01

26    5.286095e-01

27    5.761865e-01

28    5.638573e-01

29   -1.809208e+24

Name: lr3, dtype: float64

python scikit-learn linear-regression

asked Jan 12 at 23:18

Dan Scally

1011

asked Jan 12 at 23:18

Dan Scally

1011

asked Jan 12 at 23:18

Dan Scally

1011

asked Jan 12 at 23:18

Dan Scally

1011

asked Jan 12 at 23:18

Dan Scally

1011

bumped to the homepage by Community♦ 49 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 49 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.

edited Jan 13 at 6:14

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43897%2flinearregression-with-multiple-binary-features-sometimes-performs-poorly%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

edited Jan 13 at 6:14

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31

add a comment |

edited Jan 13 at 6:14

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31

add a comment |

edited Jan 13 at 6:14

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

edited Jan 13 at 6:14

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

edited Jan 13 at 6:14

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

answered Jan 13 at 2:00

Sridhar Thiagarajan

286

$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31

add a comment |

$begingroup$
Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
$endgroup$
– Dan Scally
Jan 13 at 8:31

Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.

– Dan Scally
Jan 13 at 8:31

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

xwF2,PUL0u3qXtvvUC,KGMfCrV60QD

搜尋此網誌

Gfyuki