LinearRegression with multiple binary features sometimes performs poorly












0












$begingroup$


I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.



My dataset



I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:



import numpy as np
scores = np.array()

for i in range(1000):

x3_train, x3_test, y3_train, y3_test = train_test_split(
df3.drop('SalePrice', axis=1),
df3.SalePrice,
test_size=0.33
)

lr3 = LinearRegression()
lr3.fit(x3_train, y3_train)

scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))

print(scores.mean())


Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:



0     5.907010e-01
1 6.044523e-01
2 5.178049e-01
3 5.622240e-01
4 5.810432e-01
5 5.131722e-01
6 5.772946e-01
7 4.674152e-01
8 4.962015e-01
9 4.887872e-01
10 5.144772e-01
11 5.676829e-01
12 5.122566e-01
13 5.453985e-01
14 5.355022e-01
15 5.888459e-01
16 5.552912e-01
17 5.615658e-01
18 5.472429e-01
19 5.810185e-01
20 5.334900e-01
21 5.493619e-01
22 5.567195e-01
23 5.514374e-01
24 4.916478e-01
25 4.580718e-01
26 5.286095e-01
27 5.761865e-01
28 5.638573e-01
29 -1.809208e+24
Name: lr3, dtype: float64


I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.










share|improve this question









$endgroup$




bumped to the homepage by Community 49 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    0












    $begingroup$


    I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.



    My dataset



    I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:



    import numpy as np
    scores = np.array()

    for i in range(1000):

    x3_train, x3_test, y3_train, y3_test = train_test_split(
    df3.drop('SalePrice', axis=1),
    df3.SalePrice,
    test_size=0.33
    )

    lr3 = LinearRegression()
    lr3.fit(x3_train, y3_train)

    scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))

    print(scores.mean())


    Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:



    0     5.907010e-01
    1 6.044523e-01
    2 5.178049e-01
    3 5.622240e-01
    4 5.810432e-01
    5 5.131722e-01
    6 5.772946e-01
    7 4.674152e-01
    8 4.962015e-01
    9 4.887872e-01
    10 5.144772e-01
    11 5.676829e-01
    12 5.122566e-01
    13 5.453985e-01
    14 5.355022e-01
    15 5.888459e-01
    16 5.552912e-01
    17 5.615658e-01
    18 5.472429e-01
    19 5.810185e-01
    20 5.334900e-01
    21 5.493619e-01
    22 5.567195e-01
    23 5.514374e-01
    24 4.916478e-01
    25 4.580718e-01
    26 5.286095e-01
    27 5.761865e-01
    28 5.638573e-01
    29 -1.809208e+24
    Name: lr3, dtype: float64


    I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.










    share|improve this question









    $endgroup$




    bumped to the homepage by Community 49 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      0












      0








      0





      $begingroup$


      I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.



      My dataset



      I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:



      import numpy as np
      scores = np.array()

      for i in range(1000):

      x3_train, x3_test, y3_train, y3_test = train_test_split(
      df3.drop('SalePrice', axis=1),
      df3.SalePrice,
      test_size=0.33
      )

      lr3 = LinearRegression()
      lr3.fit(x3_train, y3_train)

      scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))

      print(scores.mean())


      Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:



      0     5.907010e-01
      1 6.044523e-01
      2 5.178049e-01
      3 5.622240e-01
      4 5.810432e-01
      5 5.131722e-01
      6 5.772946e-01
      7 4.674152e-01
      8 4.962015e-01
      9 4.887872e-01
      10 5.144772e-01
      11 5.676829e-01
      12 5.122566e-01
      13 5.453985e-01
      14 5.355022e-01
      15 5.888459e-01
      16 5.552912e-01
      17 5.615658e-01
      18 5.472429e-01
      19 5.810185e-01
      20 5.334900e-01
      21 5.493619e-01
      22 5.567195e-01
      23 5.514374e-01
      24 4.916478e-01
      25 4.580718e-01
      26 5.286095e-01
      27 5.761865e-01
      28 5.638573e-01
      29 -1.809208e+24
      Name: lr3, dtype: float64


      I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.










      share|improve this question









      $endgroup$




      I have a dataset comprising a number of binary features which are the dummies (as in, pd.get_dummies()) of categorical features. SalePrice is my target variable.



      My dataset



      I'm literally just fitting a sklearn LinearRegression model with that data a thousand times to get an average of the score, and I'm getting a weird result. The relevant bit of my code looks like this:



      import numpy as np
      scores = np.array()

      for i in range(1000):

      x3_train, x3_test, y3_train, y3_test = train_test_split(
      df3.drop('SalePrice', axis=1),
      df3.SalePrice,
      test_size=0.33
      )

      lr3 = LinearRegression()
      lr3.fit(x3_train, y3_train)

      scores = np.insert(scores, 0, lr3.score(x3_test, y3_test))

      print(scores.mean())


      Now the weird result is that the average result is super poor, because every so often the model just tanks completely but most of the time performs "reasonably" (still terrible but that's not a surprise as it's incredibly basic and not tuned at all, I'm just comparing the effect of treating a set of features in different ways). For example the first 30 runs generated these scores:



      0     5.907010e-01
      1 6.044523e-01
      2 5.178049e-01
      3 5.622240e-01
      4 5.810432e-01
      5 5.131722e-01
      6 5.772946e-01
      7 4.674152e-01
      8 4.962015e-01
      9 4.887872e-01
      10 5.144772e-01
      11 5.676829e-01
      12 5.122566e-01
      13 5.453985e-01
      14 5.355022e-01
      15 5.888459e-01
      16 5.552912e-01
      17 5.615658e-01
      18 5.472429e-01
      19 5.810185e-01
      20 5.334900e-01
      21 5.493619e-01
      22 5.567195e-01
      23 5.514374e-01
      24 4.916478e-01
      25 4.580718e-01
      26 5.286095e-01
      27 5.761865e-01
      28 5.638573e-01
      29 -1.809208e+24
      Name: lr3, dtype: float64


      I guess my question is what is likely to be happening on that 30th run through such that the model performs so poorly? I'm comparing this model to others that treat the data differently (e.g. simply encode using .astype('category').cat.codes) and whilst there's relatively minor variations in the "usual" range of scores (they're all sort of 0.44 - 0.63) those other models don't have this occasional complete tanking.







      python scikit-learn linear-regression






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 12 at 23:18









      Dan ScallyDan Scally

      1011




      1011





      bumped to the homepage by Community 49 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 49 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
            $endgroup$
            – Dan Scally
            Jan 13 at 8:31











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43897%2flinearregression-with-multiple-binary-features-sometimes-performs-poorly%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0












          $begingroup$

          You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
            $endgroup$
            – Dan Scally
            Jan 13 at 8:31
















          0












          $begingroup$

          You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
            $endgroup$
            – Dan Scally
            Jan 13 at 8:31














          0












          0








          0





          $begingroup$

          You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.






          share|improve this answer











          $endgroup$



          You should always consider normalizing your output to some predefined range, otherwise there is a possibility of the gradients exploding as the loss will be of high magnitudes. It also becomes hard to output such a wide range. Try transforming your output using some StandardScaler, or a RobustScaler if there are significant outliers, and try again.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 13 at 6:14

























          answered Jan 13 at 2:00









          Sridhar ThiagarajanSridhar Thiagarajan

          286




          286












          • $begingroup$
            Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
            $endgroup$
            – Dan Scally
            Jan 13 at 8:31


















          • $begingroup$
            Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
            $endgroup$
            – Dan Scally
            Jan 13 at 8:31
















          $begingroup$
          Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
          $endgroup$
          – Dan Scally
          Jan 13 at 8:31




          $begingroup$
          Thanks for the reply. I've passed the df through both StandardScaler and RobustScaler; this seems to have no affect on the results, I'm still experiencing the occasional run-through where the model performs extremely poorly.
          $endgroup$
          – Dan Scally
          Jan 13 at 8:31


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43897%2flinearregression-with-multiple-binary-features-sometimes-performs-poorly%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Ponta tanko

          Tantalo (mitologio)

          Erzsébet Schaár