Dealing with unbalanced error rate in confusion matrix
$begingroup$
Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.
In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .
I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?
Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data
machine-learning classification confusion-matrix
$endgroup$
bumped to the homepage by Community♦ 8 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.
In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .
I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?
Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data
machine-learning classification confusion-matrix
$endgroup$
bumped to the homepage by Community♦ 8 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13
add a comment |
$begingroup$
Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.
In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .
I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?
Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data
machine-learning classification confusion-matrix
$endgroup$
Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.
In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .
I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?
Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data
machine-learning classification confusion-matrix
machine-learning classification confusion-matrix
edited Dec 13 '17 at 7:59
Chenxiong Yi
asked Dec 13 '17 at 6:19
Chenxiong YiChenxiong Yi
263
263
bumped to the homepage by Community♦ 8 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 8 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13
add a comment |
$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13
$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13
$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Welcome to the site!
I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.
For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.
Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.
How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.
Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.
Do let me know if you have any additional questions.
$endgroup$
$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27
$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28
$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30
$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35
$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37
|
show 6 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25616%2fdealing-with-unbalanced-error-rate-in-confusion-matrix%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Welcome to the site!
I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.
For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.
Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.
How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.
Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.
Do let me know if you have any additional questions.
$endgroup$
$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27
$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28
$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30
$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35
$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37
|
show 6 more comments
$begingroup$
Welcome to the site!
I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.
For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.
Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.
How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.
Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.
Do let me know if you have any additional questions.
$endgroup$
$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27
$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28
$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30
$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35
$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37
|
show 6 more comments
$begingroup$
Welcome to the site!
I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.
For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.
Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.
How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.
Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.
Do let me know if you have any additional questions.
$endgroup$
Welcome to the site!
I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.
For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.
Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.
How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.
Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.
Do let me know if you have any additional questions.
answered Dec 13 '17 at 6:37
Toros91Toros91
2,0042829
2,0042829
$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27
$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28
$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30
$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35
$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37
|
show 6 more comments
$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27
$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28
$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30
$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35
$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37
$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27
$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27
$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28
$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28
$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30
$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30
$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35
$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35
$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37
$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37
|
show 6 more comments
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25616%2fdealing-with-unbalanced-error-rate-in-confusion-matrix%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13