Data splitting for a binary classification model
$begingroup$
I'm trying to build a binary classification model that will tell who's going to buy the product and who's not. I've heard that splitting a dataset into two different subsets is a common way when you prepare an input data.
[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]
Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?
Imagine I have this simple dataset below.
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
1,Lianne,9,0
1,Lianne,10,0
As the common recommended way, I splitted it into two groups.
// Training Data Set
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
// Test Set
UserId,UserName,AppId,Purchased
1,Lianne,9,0
1,Lianne,10,0
Would this work? well it seemed not and it turned out it actually didn't. The model was wrong about predicting on the appId of 6,7,8,9. It thought the user number one would buy them with a slightly high chance. The metrics look like...
- TP : 5
- FP : 4
- FN : 1
- Accuracy : 0.5
- Auc : NaN
- F1Score : NaN
- Precision : 0
- Negative Precision : 1
- Negative Recall : 0.5
To make a proper model, what my test dataset should look like on this sample training data?
machine-learning classification
New contributor
$endgroup$
add a comment |
$begingroup$
I'm trying to build a binary classification model that will tell who's going to buy the product and who's not. I've heard that splitting a dataset into two different subsets is a common way when you prepare an input data.
[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]
Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?
Imagine I have this simple dataset below.
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
1,Lianne,9,0
1,Lianne,10,0
As the common recommended way, I splitted it into two groups.
// Training Data Set
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
// Test Set
UserId,UserName,AppId,Purchased
1,Lianne,9,0
1,Lianne,10,0
Would this work? well it seemed not and it turned out it actually didn't. The model was wrong about predicting on the appId of 6,7,8,9. It thought the user number one would buy them with a slightly high chance. The metrics look like...
- TP : 5
- FP : 4
- FN : 1
- Accuracy : 0.5
- Auc : NaN
- F1Score : NaN
- Precision : 0
- Negative Precision : 1
- Negative Recall : 0.5
To make a proper model, what my test dataset should look like on this sample training data?
machine-learning classification
New contributor
$endgroup$
add a comment |
$begingroup$
I'm trying to build a binary classification model that will tell who's going to buy the product and who's not. I've heard that splitting a dataset into two different subsets is a common way when you prepare an input data.
[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]
Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?
Imagine I have this simple dataset below.
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
1,Lianne,9,0
1,Lianne,10,0
As the common recommended way, I splitted it into two groups.
// Training Data Set
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
// Test Set
UserId,UserName,AppId,Purchased
1,Lianne,9,0
1,Lianne,10,0
Would this work? well it seemed not and it turned out it actually didn't. The model was wrong about predicting on the appId of 6,7,8,9. It thought the user number one would buy them with a slightly high chance. The metrics look like...
- TP : 5
- FP : 4
- FN : 1
- Accuracy : 0.5
- Auc : NaN
- F1Score : NaN
- Precision : 0
- Negative Precision : 1
- Negative Recall : 0.5
To make a proper model, what my test dataset should look like on this sample training data?
machine-learning classification
New contributor
$endgroup$
I'm trying to build a binary classification model that will tell who's going to buy the product and who's not. I've heard that splitting a dataset into two different subsets is a common way when you prepare an input data.
[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]
Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?
Imagine I have this simple dataset below.
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
1,Lianne,9,0
1,Lianne,10,0
As the common recommended way, I splitted it into two groups.
// Training Data Set
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
// Test Set
UserId,UserName,AppId,Purchased
1,Lianne,9,0
1,Lianne,10,0
Would this work? well it seemed not and it turned out it actually didn't. The model was wrong about predicting on the appId of 6,7,8,9. It thought the user number one would buy them with a slightly high chance. The metrics look like...
- TP : 5
- FP : 4
- FN : 1
- Accuracy : 0.5
- Auc : NaN
- F1Score : NaN
- Precision : 0
- Negative Precision : 1
- Negative Recall : 0.5
To make a proper model, what my test dataset should look like on this sample training data?
machine-learning classification
machine-learning classification
New contributor
New contributor
edited 17 hours ago
hina10531
New contributor
asked 18 hours ago
hina10531hina10531
1064
1064
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
My 2 cents:
the number of records in the data set used here is very small. If we have a look into the data set we can see that the target variable split is exactly 50:50 which means the probability is half. Its like flipping a coin to get heads or tail.
The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The dependent variables and the independent variable should be in splatted and then do a train test fit.
You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
hina10531 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44180%2fdata-splitting-for-a-binary-classification-model%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
My 2 cents:
the number of records in the data set used here is very small. If we have a look into the data set we can see that the target variable split is exactly 50:50 which means the probability is half. Its like flipping a coin to get heads or tail.
The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The dependent variables and the independent variable should be in splatted and then do a train test fit.
You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split
$endgroup$
add a comment |
$begingroup$
My 2 cents:
the number of records in the data set used here is very small. If we have a look into the data set we can see that the target variable split is exactly 50:50 which means the probability is half. Its like flipping a coin to get heads or tail.
The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The dependent variables and the independent variable should be in splatted and then do a train test fit.
You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split
$endgroup$
add a comment |
$begingroup$
My 2 cents:
the number of records in the data set used here is very small. If we have a look into the data set we can see that the target variable split is exactly 50:50 which means the probability is half. Its like flipping a coin to get heads or tail.
The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The dependent variables and the independent variable should be in splatted and then do a train test fit.
You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split
$endgroup$
My 2 cents:
the number of records in the data set used here is very small. If we have a look into the data set we can see that the target variable split is exactly 50:50 which means the probability is half. Its like flipping a coin to get heads or tail.
The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The dependent variables and the independent variable should be in splatted and then do a train test fit.
You can use the library from scikit learn as well
from sklearn.model_selection import train_test_split
answered 17 hours ago
SunilSunil
794
794
add a comment |
add a comment |
hina10531 is a new contributor. Be nice, and check out our Code of Conduct.
hina10531 is a new contributor. Be nice, and check out our Code of Conduct.
hina10531 is a new contributor. Be nice, and check out our Code of Conduct.
hina10531 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44180%2fdata-splitting-for-a-binary-classification-model%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown