Dealing with unbalanced error rate in confusion matrix












5












$begingroup$


enter image description here



Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.



In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .



I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?



Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data










share|improve this question











$endgroup$




bumped to the homepage by Community 8 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    The dataset is not skewed and every class has the same amount of training instances.
    $endgroup$
    – Chenxiong Yi
    Dec 15 '17 at 7:13
















5












$begingroup$


enter image description here



Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.



In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .



I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?



Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data










share|improve this question











$endgroup$




bumped to the homepage by Community 8 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    The dataset is not skewed and every class has the same amount of training instances.
    $endgroup$
    – Chenxiong Yi
    Dec 15 '17 at 7:13














5












5








5


1



$begingroup$


enter image description here



Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.



In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .



I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?



Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data










share|improve this question











$endgroup$




enter image description here



Here is the confusion matrix I got when I was playing with Forest Type Cover Kaggle dataset : Link.



In the matrix, light color and higher numbers represent higher error rates, so as you can see, lots of mis-classification happened between class 1 and 0 .



I wonder what kind of methods I can use to reduce these two error rates though improvements have been made through combing two classifiers, Random Forest and Extra Tree. Will stacking help in this case?



Data can be found on
https://www.kaggle.com/c/forest-cover-type-prediction/data







machine-learning classification confusion-matrix






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 13 '17 at 7:59







Chenxiong Yi

















asked Dec 13 '17 at 6:19









Chenxiong YiChenxiong Yi

263




263





bumped to the homepage by Community 8 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 8 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • $begingroup$
    The dataset is not skewed and every class has the same amount of training instances.
    $endgroup$
    – Chenxiong Yi
    Dec 15 '17 at 7:13


















  • $begingroup$
    The dataset is not skewed and every class has the same amount of training instances.
    $endgroup$
    – Chenxiong Yi
    Dec 15 '17 at 7:13
















$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13




$begingroup$
The dataset is not skewed and every class has the same amount of training instances.
$endgroup$
– Chenxiong Yi
Dec 15 '17 at 7:13










1 Answer
1






active

oldest

votes


















1












$begingroup$

Welcome to the site!



I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.



For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.



Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.



How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.



Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.



Do let me know if you have any additional questions.






share|improve this answer









$endgroup$













  • $begingroup$
    Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:27










  • $begingroup$
    So the data is normally distributed, what all features do you have?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:28












  • $begingroup$
    kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:30












  • $begingroup$
    Can you explain the above statement with an example?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:35










  • $begingroup$
    I just mean no class has more training data than other classes. Sorry for the confusion.
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:37














Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25616%2fdealing-with-unbalanced-error-rate-in-confusion-matrix%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

Welcome to the site!



I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.



For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.



Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.



How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.



Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.



Do let me know if you have any additional questions.






share|improve this answer









$endgroup$













  • $begingroup$
    Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:27










  • $begingroup$
    So the data is normally distributed, what all features do you have?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:28












  • $begingroup$
    kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:30












  • $begingroup$
    Can you explain the above statement with an example?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:35










  • $begingroup$
    I just mean no class has more training data than other classes. Sorry for the confusion.
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:37


















1












$begingroup$

Welcome to the site!



I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.



For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.



Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.



How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.



Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.



Do let me know if you have any additional questions.






share|improve this answer









$endgroup$













  • $begingroup$
    Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:27










  • $begingroup$
    So the data is normally distributed, what all features do you have?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:28












  • $begingroup$
    kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:30












  • $begingroup$
    Can you explain the above statement with an example?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:35










  • $begingroup$
    I just mean no class has more training data than other classes. Sorry for the confusion.
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:37
















1












1








1





$begingroup$

Welcome to the site!



I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.



For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.



Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.



How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.



Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.



Do let me know if you have any additional questions.






share|improve this answer









$endgroup$



Welcome to the site!



I think Ensemble Method is very tricky. when one of the model doesn't work well then the accuracy of the Ensemble also goes down.



For instance let us consider that you are using RandomForest(RF) and Rpart for classification and RF accuracy is 90% and Rpart accuracy is 60%. If you ensemble these 2 models then the Ensemble accuracy goes down.



Coming to your scenario, you need to be very careful at the time of stacking, you need to select the models that are performing moderately and then stack them to improve the accuracy.



How is the distribution of 0/1's, if they are imbalanced then you need to balance to improve the accuracy of the model. To handle imbalance data situation we use packages like SMOTE,ROSE etc.



Feature Engineering like adding external factors or adding new features, might help you to improve your models accuracy.



Do let me know if you have any additional questions.







share|improve this answer












share|improve this answer



share|improve this answer










answered Dec 13 '17 at 6:37









Toros91Toros91

2,0042829




2,0042829












  • $begingroup$
    Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:27










  • $begingroup$
    So the data is normally distributed, what all features do you have?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:28












  • $begingroup$
    kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:30












  • $begingroup$
    Can you explain the above statement with an example?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:35










  • $begingroup$
    I just mean no class has more training data than other classes. Sorry for the confusion.
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:37




















  • $begingroup$
    Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:27










  • $begingroup$
    So the data is normally distributed, what all features do you have?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:28












  • $begingroup$
    kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:30












  • $begingroup$
    Can you explain the above statement with an example?
    $endgroup$
    – Toros91
    Dec 13 '17 at 7:35










  • $begingroup$
    I just mean no class has more training data than other classes. Sorry for the confusion.
    $endgroup$
    – Chenxiong Yi
    Dec 13 '17 at 7:37


















$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27




$begingroup$
Thank you for your answer! All classes in this data-set actually have same size, so 0 and 1 are not minority in this case. I feel that since 0 and 1 are pretty much the same in almost all features, it is hard to classify them correctly. Do you know any way to handle this scenario?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:27












$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28






$begingroup$
So the data is normally distributed, what all features do you have?
$endgroup$
– Toros91
Dec 13 '17 at 7:28














$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30






$begingroup$
kaggle.com/c/forest-cover-type-prediction/data you can see all features here. By the way, since all classes have same sizes, shouldn't the distribution be uniform?
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:30














$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35




$begingroup$
Can you explain the above statement with an example?
$endgroup$
– Toros91
Dec 13 '17 at 7:35












$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37






$begingroup$
I just mean no class has more training data than other classes. Sorry for the confusion.
$endgroup$
– Chenxiong Yi
Dec 13 '17 at 7:37




















draft saved

draft discarded




















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f25616%2fdealing-with-unbalanced-error-rate-in-confusion-matrix%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ponta tanko

Tantalo (mitologio)

Erzsébet Schaár