How to sample a statistically uniform dataset
$begingroup$
I have a data set of with a distribution that looks something like this:
I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):
I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.
P.S. the result doesn't have to be 100% accurate - a good approximation is enough.
dataset
$endgroup$
This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.
This question has not received enough attention.
add a comment |
$begingroup$
I have a data set of with a distribution that looks something like this:
I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):
I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.
P.S. the result doesn't have to be 100% accurate - a good approximation is enough.
dataset
$endgroup$
This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.
This question has not received enough attention.
$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52
1
$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36
add a comment |
$begingroup$
I have a data set of with a distribution that looks something like this:
I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):
I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.
P.S. the result doesn't have to be 100% accurate - a good approximation is enough.
dataset
$endgroup$
I have a data set of with a distribution that looks something like this:
I need to take random sample data from the set so that distribution will be more even. Something like this (take the data in the green area):
I know how to do this by taking the data, putting into separate "buckets"
(distribute the data into X buckets, taking up to Y samples from each bucket), but I was wondering if there is there an easier way.
P.S. the result doesn't have to be 100% accurate - a good approximation is enough.
dataset
dataset
edited May 13 '17 at 11:31
VividD
564517
564517
asked Nov 29 '16 at 10:09
travehtraveh
1112
1112
This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.
This question has not received enough attention.
This question has an open bounty worth +50
reputation from davidparks21 ending in 7 days.
This question has not received enough attention.
$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52
1
$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36
add a comment |
$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52
1
$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36
$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52
$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52
1
1
$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36
$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
There are a variety of sampling options:
- Sample directly from the data which perfectly model the empirical distribution.
- Fit a kernel density estimation (kde). Sample from the estimated kde function.
- Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.
You can create variations of the data or distributions:
- Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.
- The histogram values could be changed to any shape.
Then the changed data or distributions could be sampled from.
The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15419%2fhow-to-sample-a-statistically-uniform-dataset%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There are a variety of sampling options:
- Sample directly from the data which perfectly model the empirical distribution.
- Fit a kernel density estimation (kde). Sample from the estimated kde function.
- Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.
You can create variations of the data or distributions:
- Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.
- The histogram values could be changed to any shape.
Then the changed data or distributions could be sampled from.
The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.
$endgroup$
add a comment |
$begingroup$
There are a variety of sampling options:
- Sample directly from the data which perfectly model the empirical distribution.
- Fit a kernel density estimation (kde). Sample from the estimated kde function.
- Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.
You can create variations of the data or distributions:
- Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.
- The histogram values could be changed to any shape.
Then the changed data or distributions could be sampled from.
The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.
$endgroup$
add a comment |
$begingroup$
There are a variety of sampling options:
- Sample directly from the data which perfectly model the empirical distribution.
- Fit a kernel density estimation (kde). Sample from the estimated kde function.
- Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.
You can create variations of the data or distributions:
- Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.
- The histogram values could be changed to any shape.
Then the changed data or distributions could be sampled from.
The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.
$endgroup$
There are a variety of sampling options:
- Sample directly from the data which perfectly model the empirical distribution.
- Fit a kernel density estimation (kde). Sample from the estimated kde function.
- Create a histogram of the data, aka bin the data. Then treat each histogram as a probability mass function (pmf). Sample from bins proportional to their frequency.
You can create variations of the data or distributions:
- Apply a transformation to the data. For example, a log transformation will transform skewed distribution to be approximately normal.
- The histogram values could be changed to any shape.
Then the changed data or distributions could be sampled from.
The most extreme option would define a uniform distribution across data domain and sample from that distribution. The green box is a uniform distribution.
answered 9 mins ago
Brian SpieringBrian Spiering
3,9131028
3,9131028
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f15419%2fhow-to-sample-a-statistically-uniform-dataset%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
The buckets solution will be approximate and good enough. Why do you need a uniform distribution?
$endgroup$
– K3---rnc
Nov 29 '16 at 16:52
1
$begingroup$
I implemented the buckets solution it's huge amounts of data and the sorting into buckets takes a long time. Also, I'm curious if there's a more "mathematical" solution.
$endgroup$
– traveh
Nov 30 '16 at 8:36