How reliable are model performance reportings?












0












$begingroup$


I had a conceptual doubt about estimating and reporting a classification model's performance. Say my model works with range of depth values and gives out different readings of test errors. We choose the model with depth having lowest test error rate as M1



Now if we want to report our model's performance on a hidden test set, would it wise to say that M1 would perform equally well on this new test set and with the same test error rate?










share|improve this question











$endgroup$












  • $begingroup$
    Say we choose a model with depth 3 and test error of 12.5%. Now if we use this model to test it on another hidden test set, would it again give a test error of 12.5%?
    $endgroup$
    – Shekhar Tanwar
    11 hours ago










  • $begingroup$
    Depends: Is the distribution similar? Is the size big enough? Is the networks prediction stable enough?
    $endgroup$
    – Martin Thoma
    10 hours ago










  • $begingroup$
    Thank you, this helps a lot.
    $endgroup$
    – Shekhar Tanwar
    9 hours ago
















0












$begingroup$


I had a conceptual doubt about estimating and reporting a classification model's performance. Say my model works with range of depth values and gives out different readings of test errors. We choose the model with depth having lowest test error rate as M1



Now if we want to report our model's performance on a hidden test set, would it wise to say that M1 would perform equally well on this new test set and with the same test error rate?










share|improve this question











$endgroup$












  • $begingroup$
    Say we choose a model with depth 3 and test error of 12.5%. Now if we use this model to test it on another hidden test set, would it again give a test error of 12.5%?
    $endgroup$
    – Shekhar Tanwar
    11 hours ago










  • $begingroup$
    Depends: Is the distribution similar? Is the size big enough? Is the networks prediction stable enough?
    $endgroup$
    – Martin Thoma
    10 hours ago










  • $begingroup$
    Thank you, this helps a lot.
    $endgroup$
    – Shekhar Tanwar
    9 hours ago














0












0








0





$begingroup$


I had a conceptual doubt about estimating and reporting a classification model's performance. Say my model works with range of depth values and gives out different readings of test errors. We choose the model with depth having lowest test error rate as M1



Now if we want to report our model's performance on a hidden test set, would it wise to say that M1 would perform equally well on this new test set and with the same test error rate?










share|improve this question











$endgroup$




I had a conceptual doubt about estimating and reporting a classification model's performance. Say my model works with range of depth values and gives out different readings of test errors. We choose the model with depth having lowest test error rate as M1



Now if we want to report our model's performance on a hidden test set, would it wise to say that M1 would perform equally well on this new test set and with the same test error rate?







machine-learning classification performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 9 mins ago









Martin Thoma

6,0701353126




6,0701353126










asked 11 hours ago









Shekhar TanwarShekhar Tanwar

61




61












  • $begingroup$
    Say we choose a model with depth 3 and test error of 12.5%. Now if we use this model to test it on another hidden test set, would it again give a test error of 12.5%?
    $endgroup$
    – Shekhar Tanwar
    11 hours ago










  • $begingroup$
    Depends: Is the distribution similar? Is the size big enough? Is the networks prediction stable enough?
    $endgroup$
    – Martin Thoma
    10 hours ago










  • $begingroup$
    Thank you, this helps a lot.
    $endgroup$
    – Shekhar Tanwar
    9 hours ago


















  • $begingroup$
    Say we choose a model with depth 3 and test error of 12.5%. Now if we use this model to test it on another hidden test set, would it again give a test error of 12.5%?
    $endgroup$
    – Shekhar Tanwar
    11 hours ago










  • $begingroup$
    Depends: Is the distribution similar? Is the size big enough? Is the networks prediction stable enough?
    $endgroup$
    – Martin Thoma
    10 hours ago










  • $begingroup$
    Thank you, this helps a lot.
    $endgroup$
    – Shekhar Tanwar
    9 hours ago
















$begingroup$
Say we choose a model with depth 3 and test error of 12.5%. Now if we use this model to test it on another hidden test set, would it again give a test error of 12.5%?
$endgroup$
– Shekhar Tanwar
11 hours ago




$begingroup$
Say we choose a model with depth 3 and test error of 12.5%. Now if we use this model to test it on another hidden test set, would it again give a test error of 12.5%?
$endgroup$
– Shekhar Tanwar
11 hours ago












$begingroup$
Depends: Is the distribution similar? Is the size big enough? Is the networks prediction stable enough?
$endgroup$
– Martin Thoma
10 hours ago




$begingroup$
Depends: Is the distribution similar? Is the size big enough? Is the networks prediction stable enough?
$endgroup$
– Martin Thoma
10 hours ago












$begingroup$
Thank you, this helps a lot.
$endgroup$
– Shekhar Tanwar
9 hours ago




$begingroup$
Thank you, this helps a lot.
$endgroup$
– Shekhar Tanwar
9 hours ago










1 Answer
1






active

oldest

votes


















1












$begingroup$

The question of the measured test error of a classification model is reliable, hence if the test error on unknown set $T_1$ is the same as on $T_2$ is hard two answer. It depends on the following factors:




  • How many digits of the error are reported?

  • How many samples have $T_1$ and $T_2$? The more digits are reported, the more samples you need. As a rule of thumb, make sure that any change in the reported error means at least 3 samples have changed. So if you use accuracy and report two decimal places (e.g. 12.34%), then 0.01% must be bigger than 3 => $3 < frac{0.01}{100} cdot |T_1| Leftrightarrow 30000 < |T_1|$

  • The distribution must be similar. The simplest part is the distribution of classes. The more difficult part is how the features look like.


For other forms of error analysis, you might want to look into my Master's thesis Analysis and Optimization of Convolutional Neural Network Architectures






share|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44875%2fhow-reliable-are-model-performance-reportings%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    The question of the measured test error of a classification model is reliable, hence if the test error on unknown set $T_1$ is the same as on $T_2$ is hard two answer. It depends on the following factors:




    • How many digits of the error are reported?

    • How many samples have $T_1$ and $T_2$? The more digits are reported, the more samples you need. As a rule of thumb, make sure that any change in the reported error means at least 3 samples have changed. So if you use accuracy and report two decimal places (e.g. 12.34%), then 0.01% must be bigger than 3 => $3 < frac{0.01}{100} cdot |T_1| Leftrightarrow 30000 < |T_1|$

    • The distribution must be similar. The simplest part is the distribution of classes. The more difficult part is how the features look like.


    For other forms of error analysis, you might want to look into my Master's thesis Analysis and Optimization of Convolutional Neural Network Architectures






    share|improve this answer











    $endgroup$


















      1












      $begingroup$

      The question of the measured test error of a classification model is reliable, hence if the test error on unknown set $T_1$ is the same as on $T_2$ is hard two answer. It depends on the following factors:




      • How many digits of the error are reported?

      • How many samples have $T_1$ and $T_2$? The more digits are reported, the more samples you need. As a rule of thumb, make sure that any change in the reported error means at least 3 samples have changed. So if you use accuracy and report two decimal places (e.g. 12.34%), then 0.01% must be bigger than 3 => $3 < frac{0.01}{100} cdot |T_1| Leftrightarrow 30000 < |T_1|$

      • The distribution must be similar. The simplest part is the distribution of classes. The more difficult part is how the features look like.


      For other forms of error analysis, you might want to look into my Master's thesis Analysis and Optimization of Convolutional Neural Network Architectures






      share|improve this answer











      $endgroup$
















        1












        1








        1





        $begingroup$

        The question of the measured test error of a classification model is reliable, hence if the test error on unknown set $T_1$ is the same as on $T_2$ is hard two answer. It depends on the following factors:




        • How many digits of the error are reported?

        • How many samples have $T_1$ and $T_2$? The more digits are reported, the more samples you need. As a rule of thumb, make sure that any change in the reported error means at least 3 samples have changed. So if you use accuracy and report two decimal places (e.g. 12.34%), then 0.01% must be bigger than 3 => $3 < frac{0.01}{100} cdot |T_1| Leftrightarrow 30000 < |T_1|$

        • The distribution must be similar. The simplest part is the distribution of classes. The more difficult part is how the features look like.


        For other forms of error analysis, you might want to look into my Master's thesis Analysis and Optimization of Convolutional Neural Network Architectures






        share|improve this answer











        $endgroup$



        The question of the measured test error of a classification model is reliable, hence if the test error on unknown set $T_1$ is the same as on $T_2$ is hard two answer. It depends on the following factors:




        • How many digits of the error are reported?

        • How many samples have $T_1$ and $T_2$? The more digits are reported, the more samples you need. As a rule of thumb, make sure that any change in the reported error means at least 3 samples have changed. So if you use accuracy and report two decimal places (e.g. 12.34%), then 0.01% must be bigger than 3 => $3 < frac{0.01}{100} cdot |T_1| Leftrightarrow 30000 < |T_1|$

        • The distribution must be similar. The simplest part is the distribution of classes. The more difficult part is how the features look like.


        For other forms of error analysis, you might want to look into my Master's thesis Analysis and Optimization of Convolutional Neural Network Architectures







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 5 mins ago

























        answered 49 mins ago









        Martin ThomaMartin Thoma

        6,0701353126




        6,0701353126






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44875%2fhow-reliable-are-model-performance-reportings%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ponta tanko

            Tantalo (mitologio)

            Erzsébet Schaár