Not sure if over-fitting
$begingroup$
I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).
- Used min_max_scaler
- Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)
- Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data
- Used predict on the test data
- Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class
Seems fishy , but I can't figure out why... what am I doing wrong ?
machine-learning classification scikit-learn overfitting
$endgroup$
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).
- Used min_max_scaler
- Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)
- Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data
- Used predict on the test data
- Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class
Seems fishy , but I can't figure out why... what am I doing wrong ?
machine-learning classification scikit-learn overfitting
$endgroup$
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43
1
$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11
$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13
$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05
$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with differenttrain_test_split
(vary yourrandom_state
) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17
add a comment |
$begingroup$
I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).
- Used min_max_scaler
- Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)
- Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data
- Used predict on the test data
- Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class
Seems fishy , but I can't figure out why... what am I doing wrong ?
machine-learning classification scikit-learn overfitting
$endgroup$
I trained the data this way :
There are four classes , the data distributed evenly (same amount of labels).
- Used min_max_scaler
- Used train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)
- Ran GradientBoostingClassifier - once with n_estimators=32 and once with n_estimators=500 on the training data
- Used predict on the test data
- Got accuracy=0.94 on n_estimators=32 and accuracy=1 on n_estimators=500. Precision and recall from classification report is also 1 for all class
Seems fishy , but I can't figure out why... what am I doing wrong ?
machine-learning classification scikit-learn overfitting
machine-learning classification scikit-learn overfitting
asked Dec 9 '18 at 14:41
M.FM.F
167
167
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 11 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43
1
$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11
$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13
$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05
$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with differenttrain_test_split
(vary yourrandom_state
) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17
add a comment |
$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43
1
$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11
$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13
$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05
$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with differenttrain_test_split
(vary yourrandom_state
) in step 2 to see if your observations are consistent across different splits.
$endgroup$
– user12075
Dec 23 '18 at 8:17
$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43
$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43
1
1
$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11
$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11
$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13
$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13
$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05
$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05
$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different
train_test_split
(vary your random_state
) in step 2 to see if your observations are consistent across different splits.$endgroup$
– user12075
Dec 23 '18 at 8:17
$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different
train_test_split
(vary your random_state
) in step 2 to see if your observations are consistent across different splits.$endgroup$
– user12075
Dec 23 '18 at 8:17
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.
Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.
On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.
As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.
You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f42360%2fnot-sure-if-over-fitting%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.
Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.
On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.
As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.
You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).
$endgroup$
add a comment |
$begingroup$
Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.
Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.
On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.
As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.
You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).
$endgroup$
add a comment |
$begingroup$
Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.
Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.
On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.
As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.
You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).
$endgroup$
Depending on your data, you may be overfitting, however that isn't necessarily the definitive answer.
Gradient boosted trees are a powerful algorithm and for a while performed as state-of-the-art. If your data happen to represent the target value in a systematic way that you haven't uncovered yet, it's likely that with 500 estimating trees the algorithm found a perfect solution. It's not unheard of.
On the other hand, I don't know much about your data. How many samples do you have? 100? 100,000? The former will be much easier to perfectly model. The latter may also be predictable (albeit less likely) if the variance between classes is predictable. The number of features may also play a role, and the significance of each feature.
As suggested in the comments, Cross Validation may help you discover what's going on here. I highly suggest reading the paper I linked above to see an example of rigorous CV. Carefully read what they did to see how you can model your own CV setup.
You might consider checking out the feature importance returned by your classifier. If one feature is significantly important, it might indicate a close correlation between that feature and the target variable (which should indicate that you need to take a close look at that feature).
answered Dec 23 '18 at 0:41
Alex LAlex L
1378
1378
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f42360%2fnot-sure-if-over-fitting%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Are the observations independent?
$endgroup$
– Michael M
Dec 9 '18 at 15:43
1
$begingroup$
Have you tried with cross-validation? Maybe your seed creates an unusually perfect split.
$endgroup$
– Skiddles
Dec 9 '18 at 17:11
$begingroup$
Sorry to ask the obvious, but is your label being used in the inputs?
$endgroup$
– Skiddles
Dec 9 '18 at 17:13
$begingroup$
@MichaelM yes, each example is independent of the others
$endgroup$
– M.F
Dec 10 '18 at 7:05
$begingroup$
Since you have split your data into training set and test set, it would be helpful to report both the training accuracy and test accuracy. You may also want to report the results with different
train_test_split
(vary yourrandom_state
) in step 2 to see if your observations are consistent across different splits.$endgroup$
– user12075
Dec 23 '18 at 8:17