Different approaches of creating the test set
$begingroup$
I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches
The naive way of creating the test set is
def split_train_test(data,test_set_ratio):
#create indices
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_set_ratio)
test_set_indices = shuffled_indices[:test_set_size]
train_set_indices = shuffled_indices[test_set_size:]
return data.iloc[train_set_indices],data.iloc[test_set_indices]
The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)
- Save the test set on the first run and then load it in subsequent runs
- To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices
But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.
Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.
Then the author came up with another reliable approach to create the test.
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Approach #1
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
Approach #2
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.
Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest()
and I got different results.
Is there any intuition behind these results ?.
machine-learning python preprocessing numpy
$endgroup$
bumped to the homepage by Community♦ 10 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches
The naive way of creating the test set is
def split_train_test(data,test_set_ratio):
#create indices
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_set_ratio)
test_set_indices = shuffled_indices[:test_set_size]
train_set_indices = shuffled_indices[test_set_size:]
return data.iloc[train_set_indices],data.iloc[test_set_indices]
The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)
- Save the test set on the first run and then load it in subsequent runs
- To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices
But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.
Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.
Then the author came up with another reliable approach to create the test.
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Approach #1
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
Approach #2
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.
Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest()
and I got different results.
Is there any intuition behind these results ?.
machine-learning python preprocessing numpy
$endgroup$
bumped to the homepage by Community♦ 10 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches
The naive way of creating the test set is
def split_train_test(data,test_set_ratio):
#create indices
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_set_ratio)
test_set_indices = shuffled_indices[:test_set_size]
train_set_indices = shuffled_indices[test_set_size:]
return data.iloc[train_set_indices],data.iloc[test_set_indices]
The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)
- Save the test set on the first run and then load it in subsequent runs
- To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices
But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.
Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.
Then the author came up with another reliable approach to create the test.
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Approach #1
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
Approach #2
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.
Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest()
and I got different results.
Is there any intuition behind these results ?.
machine-learning python preprocessing numpy
$endgroup$
I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches
The naive way of creating the test set is
def split_train_test(data,test_set_ratio):
#create indices
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_set_ratio)
test_set_indices = shuffled_indices[:test_set_size]
train_set_indices = shuffled_indices[test_set_size:]
return data.iloc[train_set_indices],data.iloc[test_set_indices]
The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)
- Save the test set on the first run and then load it in subsequent runs
- To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices
But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.
Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.
Then the author came up with another reliable approach to create the test.
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Approach #1
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
Approach #2
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.
Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest()
and I got different results.
Is there any intuition behind these results ?.
machine-learning python preprocessing numpy
machine-learning python preprocessing numpy
asked Jun 13 '18 at 9:28
James K JJames K J
1198
1198
bumped to the homepage by Community♦ 10 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 10 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
It gets a little complicated, I've attached links at the end of the answer to explain as well.
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).
The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).
Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.
Rest is explained well in these links:
https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data
https://github.com/ageron/handson-ml/issues/71
https://docs.python.org/3/library/hashlib.html
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33072%2fdifferent-approaches-of-creating-the-test-set%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It gets a little complicated, I've attached links at the end of the answer to explain as well.
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).
The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).
Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.
Rest is explained well in these links:
https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data
https://github.com/ageron/handson-ml/issues/71
https://docs.python.org/3/library/hashlib.html
$endgroup$
add a comment |
$begingroup$
It gets a little complicated, I've attached links at the end of the answer to explain as well.
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).
The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).
Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.
Rest is explained well in these links:
https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data
https://github.com/ageron/handson-ml/issues/71
https://docs.python.org/3/library/hashlib.html
$endgroup$
add a comment |
$begingroup$
It gets a little complicated, I've attached links at the end of the answer to explain as well.
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).
The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).
Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.
Rest is explained well in these links:
https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data
https://github.com/ageron/handson-ml/issues/71
https://docs.python.org/3/library/hashlib.html
$endgroup$
It gets a little complicated, I've attached links at the end of the answer to explain as well.
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).
The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).
Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.
Rest is explained well in these links:
https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data
https://github.com/ageron/handson-ml/issues/71
https://docs.python.org/3/library/hashlib.html
edited Dec 26 '18 at 6:24
answered Dec 26 '18 at 6:18
AbhiAbhi
11
11
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33072%2fdifferent-approaches-of-creating-the-test-set%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown