How to sample a statistically uniform dataset

I have a data set of with a distribution that looks something like this:

enter image description here

I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):

enter image description here

I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.

P.S. the result doesn't have to be 100% accurate - a good approximation is enough.

edited May 13 '17 at 11:31

VividD

564517

asked Nov 29 '16 at 10:09

traveh

1112

This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.

This question has not received enough attention.

$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52

1

$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36

add a comment |

I have a data set of with a distribution that looks something like this:

enter image description here

I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):

enter image description here

P.S. the result doesn't have to be 100% accurate - a good approximation is enough.

edited May 13 '17 at 11:31

VividD

564517

asked Nov 29 '16 at 10:09

traveh

1112

This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.

This question has not received enough attention.

$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52

1

$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36

add a comment |

I have a data set of with a distribution that looks something like this:

enter image description here

I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):

enter image description here

P.S. the result doesn't have to be 100% accurate - a good approximation is enough.

edited May 13 '17 at 11:31

VividD

564517

asked Nov 29 '16 at 10:09

traveh

1112

I have a data set of with a distribution that looks something like this:

enter image description here

I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):

enter image description here

P.S. the result doesn't have to be 100% accurate - a good approximation is enough.

dataset

edited May 13 '17 at 11:31

VividD

564517

asked Nov 29 '16 at 10:09

traveh

1112

edited May 13 '17 at 11:31

VividD

564517

asked Nov 29 '16 at 10:09

traveh

1112

edited May 13 '17 at 11:31

VividD

564517

edited May 13 '17 at 11:31

VividD

564517

edited May 13 '17 at 11:31

VividD

564517

asked Nov 29 '16 at 10:09

traveh

1112

asked Nov 29 '16 at 10:09

traveh

1112

asked Nov 29 '16 at 10:09

traveh

1112

This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.

This question has not received enough attention.

This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.

This question has not received enough attention.

$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52

1

$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36

add a comment |

$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52

1

$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36

The buckets solution will be approximate and good enough. Why do you need a uniform distribution?

– K3---rnc
Nov 29 '16 at 16:52

I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.

– traveh
Nov 30 '16 at 8:36

add a comment |

1 Answer
1

active

oldest

votes

There are a variety of sampling options:

Sample directly from the data which perfectly model the empirical distribution.

Fit a kernel density estimation (kde). Sample from the estimated kde function.

Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.

You can create variations of the data or distributions:

Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

The histogram values could be changed to any shape.

Then the changed data or distributions could be sampled from.

The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.

answered 9 mins ago

Brian Spiering

3,9131028

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15419%2fhow-to-sample-a-statistically-uniform-dataset%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

There are a variety of sampling options:

Sample directly from the data which perfectly model the empirical distribution.

Fit a kernel density estimation (kde). Sample from the estimated kde function.

Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.

You can create variations of the data or distributions:

Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

The histogram values could be changed to any shape.

Then the changed data or distributions could be sampled from.

The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.

answered 9 mins ago

Brian Spiering

3,9131028

add a comment |

There are a variety of sampling options:

Sample directly from the data which perfectly model the empirical distribution.

Fit a kernel density estimation (kde). Sample from the estimated kde function.

Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.

You can create variations of the data or distributions:

Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

The histogram values could be changed to any shape.

Then the changed data or distributions could be sampled from.

The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.

answered 9 mins ago

Brian Spiering

3,9131028

add a comment |

There are a variety of sampling options:

Sample directly from the data which perfectly model the empirical distribution.

Fit a kernel density estimation (kde). Sample from the estimated kde function.

Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.

You can create variations of the data or distributions:

Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

The histogram values could be changed to any shape.

Then the changed data or distributions could be sampled from.

The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.

answered 9 mins ago

Brian Spiering

3,9131028

There are a variety of sampling options:

Sample directly from the data which perfectly model the empirical distribution.

Fit a kernel density estimation (kde). Sample from the estimated kde function.

Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.

You can create variations of the data or distributions:

Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.

The histogram values could be changed to any shape.

Then the changed data or distributions could be sampled from.

The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.

answered 9 mins ago

Brian Spiering

3,9131028

answered 9 mins ago

Brian Spiering

3,9131028

answered 9 mins ago

Brian Spiering

3,9131028

answered 9 mins ago

Brian Spiering

3,9131028

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki