Not sure if over-fitting

Multi tool use

I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).

Used min_max_scaler

Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

Used predict on the test data

Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class

Seems fishy , but I can't figure out why... what am I doing wrong ?

asked Dec 9 '18 at 14:41

M.F

167

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43

1

$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11

$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13

$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05

$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17

add a comment |

I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).

Used min_max_scaler

Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

Used predict on the test data

Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class

Seems fishy , but I can't figure out why... what am I doing wrong ?

asked Dec 9 '18 at 14:41

M.F

167

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43

1

$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11

$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13

$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05

$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17

add a comment |

I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).

Used min_max_scaler

Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

Used predict on the test data

Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class

Seems fishy , but I can't figure out why... what am I doing wrong ?

asked Dec 9 '18 at 14:41

M.F

167

I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).

Used min_max_scaler

Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)

Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data

Used predict on the test data

Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class

Seems fishy , but I can't figure out why... what am I doing wrong ?

machine-learning classification scikit-learn overfitting

asked Dec 9 '18 at 14:41

M.F

167

asked Dec 9 '18 at 14:41

M.F

167

asked Dec 9 '18 at 14:41

M.F

167

asked Dec 9 '18 at 14:41

M.F

167

asked Dec 9 '18 at 14:41

M.F

167

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 11 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43

1

$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11

$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13

$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05

$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17

add a comment |

$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43

1

$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11

$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13

$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05

$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17

Are the observations independent?

– Michael M
Dec 9 '18 at 15:43

Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.

– Skiddles
Dec 9 '18 at 17:11

Sorry to ask the obvious, but is your label being used in the inputs?

– Skiddles
Dec 9 '18 at 17:13

@MichaelM yes, each example is independent of the others

– M.F
Dec 10 '18 at 7:05

Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different train_test_split (vary your random_state) in step 2 to see if your observations are consistent across different splits.

– user12075
Dec 23 '18 at 8:17

add a comment |

1 Answer
1

active

oldest

votes

Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.

Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.

On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.

As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.

You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).

answered Dec 23 '18 at 0:41

Alex L

1378

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f42360%2fnot-sure-if-over-fitting%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.

answered Dec 23 '18 at 0:41

Alex L

1378

add a comment |

Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.

answered Dec 23 '18 at 0:41

Alex L

1378

add a comment |

Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.

answered Dec 23 '18 at 0:41

Alex L

1378

Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.

answered Dec 23 '18 at 0:41

Alex L

1378

answered Dec 23 '18 at 0:41

Alex L

1378

answered Dec 23 '18 at 0:41

Alex L

1378

answered Dec 23 '18 at 0:41

Alex L

1378

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

dcQVAWP1Lp si,KL1v JGeXU,7n M5DRB4QUPpW7eKN l 04rBVIa00m S9Ce9u8hJGmI98P8iG,S,W 7wFWAjWVUQ0204uGEUqs4,J3

搜尋此網誌

Gfyuki