Data splitting for a binary classification model

I'm trying to build a binary classification model that will tell who's going to buy the product and who's not. I've heard that splitting a dataset into two different subsets is a common way when you prepare an input data.

[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]

Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?

Imagine I have this simple dataset below.

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0

1,Lianne,9,0

1,Lianne,10,0

As the common recommended way, I splitted it into two groups.

// Training Data Set

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0



// Test Set

UserId,UserName,AppId,Purchased

1,Lianne,9,0

1,Lianne,10,0

Would this work? well it seemed not and it turned out it actually didn't. The model was wrong about predicting on the appId of 6,7,8,9. It thought the user number one would buy them with a slightly high chance. The metrics look like...

TP : 5

FP : 4

FN : 1

Accuracy : 0.5

Auc : NaN

F1Score : NaN

Precision : 0

Negative Precision : 1

Negative Recall : 0.5

To make a proper model, what my test dataset should look like on this sample training data?

edited 17 hours ago

asked 18 hours ago

hina10531

1064

New contributor

add a comment |

[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]

Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?

Imagine I have this simple dataset below.

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0

1,Lianne,9,0

1,Lianne,10,0

As the common recommended way, I splitted it into two groups.

// Training Data Set

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0



// Test Set

UserId,UserName,AppId,Purchased

1,Lianne,9,0

1,Lianne,10,0

TP : 5

FP : 4

FN : 1

Accuracy : 0.5

Auc : NaN

F1Score : NaN

Precision : 0

Negative Precision : 1

Negative Recall : 0.5

To make a proper model, what my test dataset should look like on this sample training data?

edited 17 hours ago

asked 18 hours ago

hina10531

1064

New contributor

add a comment |

[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]

Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?

Imagine I have this simple dataset below.

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0

1,Lianne,9,0

1,Lianne,10,0

As the common recommended way, I splitted it into two groups.

// Training Data Set

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0



// Test Set

UserId,UserName,AppId,Purchased

1,Lianne,9,0

1,Lianne,10,0

TP : 5

FP : 4

FN : 1

Accuracy : 0.5

Auc : NaN

F1Score : NaN

Precision : 0

Negative Precision : 1

Negative Recall : 0.5

To make a proper model, what my test dataset should look like on this sample training data?

edited 17 hours ago

asked 18 hours ago

hina10531

1064

New contributor

[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]

Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?

Imagine I have this simple dataset below.

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0

1,Lianne,9,0

1,Lianne,10,0

As the common recommended way, I splitted it into two groups.

// Training Data Set

UserId,UserName,AppId,Purchased

1,Lianne,1,1

1,Lianne,2,1

1,Lianne,3,1

1,Lianne,4,1

1,Lianne,5,1

1,Lianne,6,0

1,Lianne,7,0

1,Lianne,8,0



// Test Set

UserId,UserName,AppId,Purchased

1,Lianne,9,0

1,Lianne,10,0

TP : 5

FP : 4

FN : 1

Accuracy : 0.5

Auc : NaN

F1Score : NaN

Precision : 0

Negative Precision : 1

Negative Recall : 0.5

To make a proper model, what my test dataset should look like on this sample training data?

machine-learning classification

edited 17 hours ago

asked 18 hours ago

hina10531

1064

New contributor

edited 17 hours ago

asked 18 hours ago

hina10531

1064

New contributor

edited 17 hours ago

asked 18 hours ago

hina10531

1064

New contributor

asked 18 hours ago

hina10531

1064

asked 18 hours ago

hina10531

1064

New contributor

hina10531 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

My 2 cents:
the number of records in the data set used here is very small. If we have a look into the data set we can see that the target variable split is exactly 50:50 which means the probability is half. Its like flipping a coin to get heads or tail.

The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The dependent variables and the independent variable should be in splatted and then do a train test fit.

You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split

answered 17 hours ago

Sunil

794

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

hina10531 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44180%2fdata-splitting-for-a-binary-classification-model%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split

answered 17 hours ago

Sunil

794

add a comment |

You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split

answered 17 hours ago

Sunil

794

add a comment |

You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split

answered 17 hours ago

Sunil

794

You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split

answered 17 hours ago

Sunil

794

answered 17 hours ago

Sunil

794

answered 17 hours ago

Sunil

794

answered 17 hours ago

Sunil

794

add a comment |

hina10531 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

hina10531 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki