During a regression task, I am getting low R^2 values, but elementwise difference between test set and...












0












$begingroup$


I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).



import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# load dataset
df = pd.read_csv('Dataset.csv')

# split into input (X) and output (Y) variables
X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
Y = df.TARGET_COLUMN

# Split the data into 67% for training and 33% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message


print("Predicting Values:")
y_pred = regressor.predict(X_test)

print("Getting Model Performance...")

# Get regression scores
print("R^2 train = ", regressor.score(X_train, Y_train))
print("R^2 test = ", regressor.score(X_test, Y_test))


This outputs the following:



Predicting Values:
Getting Model Performance...
R^2 train = 0.9791000275450427
R^2 test = 0.8577464692386905


Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:



diff = 
for i in range(len(y_pred)):
if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error


I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!



np.mean(diff)
>>> 49.07580695857447


So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.





I also checked the rmse score:



import math
rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
rmse
>>> 3.67328471827293


This seems quite high for the model to have done a good job, but please correct me if I'm wrong.



And I also checked the R^2 scores for different number of estimators:



import matplotlib.pyplot as plt
model = RandomForestRegressor(n_jobs=-1)
# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores =
for n in estimators:
model.set_params(n_estimators=n)
model.fit(X_train, Y_train)
scores.append(model.score(X_test, Y_test))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)


enter image description here



Please advise.





I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):



from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# The coefficients
print('Coefficients: n', lr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))


Coefficients:
[ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
1.95335035e+00]
Mean squared error: 86.92
Variance score: 0.08









share|improve this question









$endgroup$




bumped to the homepage by Community 11 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    0












    $begingroup$


    I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).



    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split

    # load dataset
    df = pd.read_csv('Dataset.csv')

    # split into input (X) and output (Y) variables
    X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
    Y = df.TARGET_COLUMN

    # Split the data into 67% for training and 33% for testing
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

    # Fitting the regression model to the dataset
    regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
    regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message


    print("Predicting Values:")
    y_pred = regressor.predict(X_test)

    print("Getting Model Performance...")

    # Get regression scores
    print("R^2 train = ", regressor.score(X_train, Y_train))
    print("R^2 test = ", regressor.score(X_test, Y_test))


    This outputs the following:



    Predicting Values:
    Getting Model Performance...
    R^2 train = 0.9791000275450427
    R^2 test = 0.8577464692386905


    Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:



    diff = 
    for i in range(len(y_pred)):
    if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
    diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error


    I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!



    np.mean(diff)
    >>> 49.07580695857447


    So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.





    I also checked the rmse score:



    import math
    rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
    rmse
    >>> 3.67328471827293


    This seems quite high for the model to have done a good job, but please correct me if I'm wrong.



    And I also checked the R^2 scores for different number of estimators:



    import matplotlib.pyplot as plt
    model = RandomForestRegressor(n_jobs=-1)
    # Try different numbers of n_estimators
    estimators = np.arange(10, 200, 10)
    scores =
    for n in estimators:
    model.set_params(n_estimators=n)
    model.fit(X_train, Y_train)
    scores.append(model.score(X_test, Y_test))
    plt.title("Effect of n_estimators")
    plt.xlabel("n_estimator")
    plt.ylabel("score")
    plt.plot(estimators, scores)


    enter image description here



    Please advise.





    I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):



    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score

    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    # The coefficients
    print('Coefficients: n', lr.coef_)
    # The mean squared error
    print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
    # Explained variance score: 1 is perfect prediction
    print('Variance score: %.2f' % r2_score(y_test, y_pred))


    Coefficients:
    [ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
    6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
    1.95335035e+00]
    Mean squared error: 86.92
    Variance score: 0.08









    share|improve this question









    $endgroup$




    bumped to the homepage by Community 11 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      0












      0








      0





      $begingroup$


      I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).



      import pandas as pd
      import numpy as np
      from sklearn.ensemble import RandomForestRegressor
      from sklearn.model_selection import train_test_split

      # load dataset
      df = pd.read_csv('Dataset.csv')

      # split into input (X) and output (Y) variables
      X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
      Y = df.TARGET_COLUMN

      # Split the data into 67% for training and 33% for testing
      X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

      # Fitting the regression model to the dataset
      regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
      regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message


      print("Predicting Values:")
      y_pred = regressor.predict(X_test)

      print("Getting Model Performance...")

      # Get regression scores
      print("R^2 train = ", regressor.score(X_train, Y_train))
      print("R^2 test = ", regressor.score(X_test, Y_test))


      This outputs the following:



      Predicting Values:
      Getting Model Performance...
      R^2 train = 0.9791000275450427
      R^2 test = 0.8577464692386905


      Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:



      diff = 
      for i in range(len(y_pred)):
      if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
      diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error


      I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!



      np.mean(diff)
      >>> 49.07580695857447


      So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.





      I also checked the rmse score:



      import math
      rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
      rmse
      >>> 3.67328471827293


      This seems quite high for the model to have done a good job, but please correct me if I'm wrong.



      And I also checked the R^2 scores for different number of estimators:



      import matplotlib.pyplot as plt
      model = RandomForestRegressor(n_jobs=-1)
      # Try different numbers of n_estimators
      estimators = np.arange(10, 200, 10)
      scores =
      for n in estimators:
      model.set_params(n_estimators=n)
      model.fit(X_train, Y_train)
      scores.append(model.score(X_test, Y_test))
      plt.title("Effect of n_estimators")
      plt.xlabel("n_estimator")
      plt.ylabel("score")
      plt.plot(estimators, scores)


      enter image description here



      Please advise.





      I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):



      from sklearn.linear_model import LinearRegression
      from sklearn.metrics import mean_squared_error, r2_score

      lr = LinearRegression()
      lr.fit(X_train, y_train)
      y_pred = lr.predict(X_test)

      # The coefficients
      print('Coefficients: n', lr.coef_)
      # The mean squared error
      print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
      # Explained variance score: 1 is perfect prediction
      print('Variance score: %.2f' % r2_score(y_test, y_pred))


      Coefficients:
      [ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
      6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
      1.95335035e+00]
      Mean squared error: 86.92
      Variance score: 0.08









      share|improve this question









      $endgroup$




      I am doing a random forest regression on my dataset (which has abut 15 input features and 1 target feature). I am getting a decently low R^2 of <1 for both the train and test sets (please do let me know if <1 is not a good-enough R^2 score).



      import pandas as pd
      import numpy as np
      from sklearn.ensemble import RandomForestRegressor
      from sklearn.model_selection import train_test_split

      # load dataset
      df = pd.read_csv('Dataset.csv')

      # split into input (X) and output (Y) variables
      X = df.drop(['ID_COLUMN', 'TARGET_COLUMN'], axis=1)
      Y = df.TARGET_COLUMN

      # Split the data into 67% for training and 33% for testing
      X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

      # Fitting the regression model to the dataset
      regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
      regressor.fit(X_train, Y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message


      print("Predicting Values:")
      y_pred = regressor.predict(X_test)

      print("Getting Model Performance...")

      # Get regression scores
      print("R^2 train = ", regressor.score(X_train, Y_train))
      print("R^2 test = ", regressor.score(X_test, Y_test))


      This outputs the following:



      Predicting Values:
      Getting Model Performance...
      R^2 train = 0.9791000275450427
      R^2 test = 0.8577464692386905


      Then, I checked the difference between the actual target column values in the test dataset versus the predicted values, like so:



      diff = 
      for i in range(len(y_pred)):
      if Y_test.values[i]!=0: # a few values were 0 which was causing the corresponding diff value to become inf
      diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i]) # element-wise percentage error


      I found that the majority of the element-wise differences were between 40-60% and their mean was almost 50%!



      np.mean(diff)
      >>> 49.07580695857447


      So, which one is correct? Is the regression score correct and my model is good for this data, or is the element-wise error I calculated correct and the model didn't do well for this data? If its the latter, please advise on how to increase the prediction accuracy.





      I also checked the rmse score:



      import math
      rmse = math.sqrt(np.mean((np.array(Y_test) - y_pred)**2))
      rmse
      >>> 3.67328471827293


      This seems quite high for the model to have done a good job, but please correct me if I'm wrong.



      And I also checked the R^2 scores for different number of estimators:



      import matplotlib.pyplot as plt
      model = RandomForestRegressor(n_jobs=-1)
      # Try different numbers of n_estimators
      estimators = np.arange(10, 200, 10)
      scores =
      for n in estimators:
      model.set_params(n_estimators=n)
      model.fit(X_train, Y_train)
      scores.append(model.score(X_test, Y_test))
      plt.title("Effect of n_estimators")
      plt.xlabel("n_estimator")
      plt.ylabel("score")
      plt.plot(estimators, scores)


      enter image description here



      Please advise.





      I tried using linear regression first, and got a very high MSE (which is why I was trying out random forest):



      from sklearn.linear_model import LinearRegression
      from sklearn.metrics import mean_squared_error, r2_score

      lr = LinearRegression()
      lr.fit(X_train, y_train)
      y_pred = lr.predict(X_test)

      # The coefficients
      print('Coefficients: n', lr.coef_)
      # The mean squared error
      print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
      # Explained variance score: 1 is perfect prediction
      print('Variance score: %.2f' % r2_score(y_test, y_pred))


      Coefficients:
      [ 1.93829229e-01 -4.68738825e-01 2.01635420e-01 6.35902010e-01
      6.57354434e-03 5.13180293e-03 2.84015810e-01 -1.31469084e-06
      1.95335035e+00]
      Mean squared error: 86.92
      Variance score: 0.08






      machine-learning python predictive-modeling regression random-forest






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 29 '18 at 9:01









      Kristada673Kristada673

      1715




      1715





      bumped to the homepage by Community 11 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 11 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          This line looks wrong to me:



          diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])


          Shouldn't the abs be around the entire calculation?



          diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))


          That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.



          One other thought, have you checked for outliers? This could affect some measures and not others as drastically.






          share|improve this answer









          $endgroup$














            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41842%2fduring-a-regression-task-i-am-getting-low-r2-values-but-elementwise-differenc%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            This line looks wrong to me:



            diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])


            Shouldn't the abs be around the entire calculation?



            diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))


            That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.



            One other thought, have you checked for outliers? This could affect some measures and not others as drastically.






            share|improve this answer









            $endgroup$


















              0












              $begingroup$

              This line looks wrong to me:



              diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])


              Shouldn't the abs be around the entire calculation?



              diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))


              That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.



              One other thought, have you checked for outliers? This could affect some measures and not others as drastically.






              share|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$

                This line looks wrong to me:



                diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])


                Shouldn't the abs be around the entire calculation?



                diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))


                That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.



                One other thought, have you checked for outliers? This could affect some measures and not others as drastically.






                share|improve this answer









                $endgroup$



                This line looks wrong to me:



                diff.append(100*np.abs(y_pred[i]-Y_test.values[i])/Y_test.values[i])


                Shouldn't the abs be around the entire calculation?



                diff.append(100*np.abs((y_pred[i]-Y_test.values[i])/Y_test.values[i]))


                That aside, the RMSE calculation looks accurate and is in the scale of the error, and the $R^2$ is great, so all things being equal, I would lean towards looking for something you did wrong in assessing the errors. That's why I was focused on your calculation.



                One other thought, have you checked for outliers? This could affect some measures and not others as drastically.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 29 '18 at 19:24









                SkiddlesSkiddles

                700210




                700210






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41842%2fduring-a-regression-task-i-am-getting-low-r2-values-but-elementwise-differenc%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Ponta tanko

                    Tantalo (mitologio)

                    Erzsébet Schaár