How to sample a statistically uniform dataset












2












$begingroup$


I have a data set of with a distribution that looks something like this:



enter image description here



I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):



enter image description here



I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.



P.S. the result doesn't have to be 100% accurate - a good approximation is enough.










share|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.


This question has not received enough attention.
















  • $begingroup$
    The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
    $endgroup$
    – K3---rnc
    Nov 29 '16 at 16:52








  • 1




    $begingroup$
    I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
    $endgroup$
    – traveh
    Nov 30 '16 at 8:36
















2












$begingroup$


I have a data set of with a distribution that looks something like this:



enter image description here



I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):



enter image description here



I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.



P.S. the result doesn't have to be 100% accurate - a good approximation is enough.










share|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.


This question has not received enough attention.
















  • $begingroup$
    The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
    $endgroup$
    – K3---rnc
    Nov 29 '16 at 16:52








  • 1




    $begingroup$
    I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
    $endgroup$
    – traveh
    Nov 30 '16 at 8:36














2












2








2


1



$begingroup$


I have a data set of with a distribution that looks something like this:



enter image description here



I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):



enter image description here



I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.



P.S. the result doesn't have to be 100% accurate - a good approximation is enough.










share|improve this question











$endgroup$




I have a data set of with a distribution that looks something like this:



enter image description here



I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):



enter image description here



I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.



P.S. the result doesn't have to be 100% accurate - a good approximation is enough.







dataset






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 13 '17 at 11:31









VividD

564517




564517










asked Nov 29 '16 at 10:09









travehtraveh

1112




1112






This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.


This question has not received enough attention.








This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.


This question has not received enough attention.














  • $begingroup$
    The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
    $endgroup$
    – K3---rnc
    Nov 29 '16 at 16:52








  • 1




    $begingroup$
    I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
    $endgroup$
    – traveh
    Nov 30 '16 at 8:36


















  • $begingroup$
    The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
    $endgroup$
    – K3---rnc
    Nov 29 '16 at 16:52








  • 1




    $begingroup$
    I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
    $endgroup$
    – traveh
    Nov 30 '16 at 8:36
















$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52






$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52






1




1




$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36




$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36










1 Answer
1






active

oldest

votes


















1












$begingroup$

There are a variety of sampling options:




  1. Sample directly from the data which perfectly model the empirical distribution.

  2. Fit a kernel density estimation (kde). Sample from the estimated kde function.

  3. Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.


You can create variations of the data or distributions:




  1. Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

  2. The histogram values could be changed to any shape.


Then the changed data or distributions could be sampled from.



The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.





share









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15419%2fhow-to-sample-a-statistically-uniform-dataset%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    There are a variety of sampling options:




    1. Sample directly from the data which perfectly model the empirical distribution.

    2. Fit a kernel density estimation (kde). Sample from the estimated kde function.

    3. Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.


    You can create variations of the data or distributions:




    1. Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

    2. The histogram values could be changed to any shape.


    Then the changed data or distributions could be sampled from.



    The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.





    share









    $endgroup$


















      1












      $begingroup$

      There are a variety of sampling options:




      1. Sample directly from the data which perfectly model the empirical distribution.

      2. Fit a kernel density estimation (kde). Sample from the estimated kde function.

      3. Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.


      You can create variations of the data or distributions:




      1. Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

      2. The histogram values could be changed to any shape.


      Then the changed data or distributions could be sampled from.



      The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.





      share









      $endgroup$
















        1












        1








        1





        $begingroup$

        There are a variety of sampling options:




        1. Sample directly from the data which perfectly model the empirical distribution.

        2. Fit a kernel density estimation (kde). Sample from the estimated kde function.

        3. Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.


        You can create variations of the data or distributions:




        1. Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

        2. The histogram values could be changed to any shape.


        Then the changed data or distributions could be sampled from.



        The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.





        share









        $endgroup$



        There are a variety of sampling options:




        1. Sample directly from the data which perfectly model the empirical distribution.

        2. Fit a kernel density estimation (kde). Sample from the estimated kde function.

        3. Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.


        You can create variations of the data or distributions:




        1. Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

        2. The histogram values could be changed to any shape.


        Then the changed data or distributions could be sampled from.



        The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.






        share











        share


        share










        answered 9 mins ago









        Brian SpieringBrian Spiering

        3,9131028




        3,9131028






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15419%2fhow-to-sample-a-statistically-uniform-dataset%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ponta tanko

            Tantalo (mitologio)

            Erzsébet Schaár