Different approaches of creating the test set

I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches

The naive way of creating the test set is

def split_train_test(data,test_set_ratio):

  #create indices

  shuffled_indices = np.random.permutation(len(data))

  test_set_size = int(len(data) * test_set_ratio)

  test_set_indices = shuffled_indices[:test_set_size]

  train_set_indices = shuffled_indices[test_set_size:]

  return data.iloc[train_set_indices],data.iloc[test_set_indices]

The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)

Save the test set on the first run and then load it in subsequent runs

To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices

But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.

Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.

Then the author came up with another reliable approach to create the test.

 def split_train_test_by_id(data, test_ratio, id_column):

   ids = data[id_column]

   in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))

   return data.loc[~in_test_set], data.loc[in_test_set]

Approach #1

 def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

Approach #2

 def test_set_check(identifier, test_ratio):

    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.

Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.

Is there any intuition behind these results ?.

asked Jun 13 '18 at 9:28

James K J

1198

bumped to the homepage by Community♦ 10 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

The naive way of creating the test set is

def split_train_test(data,test_set_ratio):

  #create indices

  shuffled_indices = np.random.permutation(len(data))

  test_set_size = int(len(data) * test_set_ratio)

  test_set_indices = shuffled_indices[:test_set_size]

  train_set_indices = shuffled_indices[test_set_size:]

  return data.iloc[train_set_indices],data.iloc[test_set_indices]

Save the test set on the first run and then load it in subsequent runs

To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices

But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.

Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.

Then the author came up with another reliable approach to create the test.

 def split_train_test_by_id(data, test_ratio, id_column):

   ids = data[id_column]

   in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))

   return data.loc[~in_test_set], data.loc[in_test_set]

Approach #1

 def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

Approach #2

 def test_set_check(identifier, test_ratio):

    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.

Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.

Is there any intuition behind these results ?.

asked Jun 13 '18 at 9:28

James K J

1198

bumped to the homepage by Community♦ 10 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

The naive way of creating the test set is

def split_train_test(data,test_set_ratio):

  #create indices

  shuffled_indices = np.random.permutation(len(data))

  test_set_size = int(len(data) * test_set_ratio)

  test_set_indices = shuffled_indices[:test_set_size]

  train_set_indices = shuffled_indices[test_set_size:]

  return data.iloc[train_set_indices],data.iloc[test_set_indices]

Save the test set on the first run and then load it in subsequent runs

To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices

But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.

Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.

Then the author came up with another reliable approach to create the test.

 def split_train_test_by_id(data, test_ratio, id_column):

   ids = data[id_column]

   in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))

   return data.loc[~in_test_set], data.loc[in_test_set]

Approach #1

 def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

Approach #2

 def test_set_check(identifier, test_ratio):

    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.

Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.

Is there any intuition behind these results ?.

asked Jun 13 '18 at 9:28

James K J

1198

The naive way of creating the test set is

def split_train_test(data,test_set_ratio):

  #create indices

  shuffled_indices = np.random.permutation(len(data))

  test_set_size = int(len(data) * test_set_ratio)

  test_set_indices = shuffled_indices[:test_set_size]

  train_set_indices = shuffled_indices[test_set_size:]

  return data.iloc[train_set_indices],data.iloc[test_set_indices]

Save the test set on the first run and then load it in subsequent runs

To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices

But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.

Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.

Then the author came up with another reliable approach to create the test.

 def split_train_test_by_id(data, test_ratio, id_column):

   ids = data[id_column]

   in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))

   return data.loc[~in_test_set], data.loc[in_test_set]

Approach #1

 def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

Approach #2

 def test_set_check(identifier, test_ratio):

    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.

Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.

Is there any intuition behind these results ?.

machine-learning python preprocessing numpy

asked Jun 13 '18 at 9:28

James K J

1198

asked Jun 13 '18 at 9:28

James K J

1198

asked Jun 13 '18 at 9:28

James K J

1198

asked Jun 13 '18 at 9:28

James K J

1198

asked Jun 13 '18 at 9:28

James K J

1198

bumped to the homepage by Community♦ 10 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 10 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

It gets a little complicated, I've attached links at the end of the answer to explain as well.

def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).

The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).

Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.

Rest is explained well in these links:

https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data

https://github.com/ageron/handson-ml/issues/71

https://docs.python.org/3/library/hashlib.html

edited Dec 26 '18 at 6:24

answered Dec 26 '18 at 6:18

Abhi

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33072%2fdifferent-approaches-of-creating-the-test-set%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It gets a little complicated, I've attached links at the end of the answer to explain as well.

def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).

Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.

Rest is explained well in these links:

https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data

https://github.com/ageron/handson-ml/issues/71

https://docs.python.org/3/library/hashlib.html

edited Dec 26 '18 at 6:24

answered Dec 26 '18 at 6:18

Abhi

add a comment |

It gets a little complicated, I've attached links at the end of the answer to explain as well.

def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).

Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.

Rest is explained well in these links:

https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data

https://github.com/ageron/handson-ml/issues/71

https://docs.python.org/3/library/hashlib.html

edited Dec 26 '18 at 6:24

answered Dec 26 '18 at 6:18

Abhi

add a comment |

It gets a little complicated, I've attached links at the end of the answer to explain as well.

def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).

Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.

Rest is explained well in these links:

https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data

https://github.com/ageron/handson-ml/issues/71

https://docs.python.org/3/library/hashlib.html

edited Dec 26 '18 at 6:24

answered Dec 26 '18 at 6:18

Abhi

It gets a little complicated, I've attached links at the end of the answer to explain as well.

def test_set_check(identifier, test_ratio, hash=hashlib.md5):

    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).

Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.

Rest is explained well in these links:

https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data

https://github.com/ageron/handson-ml/issues/71

https://docs.python.org/3/library/hashlib.html

edited Dec 26 '18 at 6:24

answered Dec 26 '18 at 6:18

Abhi

edited Dec 26 '18 at 6:24

answered Dec 26 '18 at 6:18

Abhi

answered Dec 26 '18 at 6:18

Abhi

answered Dec 26 '18 at 6:18

Abhi

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki