Including identifier in machine learning model as feature vs separate model for every identifier
$begingroup$
I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.
I know it will be bad idea to pit id(branch_id
in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.
I can think of two solutions, i am not sure which one is right and what is the best practice.
- Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to
600+rest_of_features
. - Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.
Looking for the suggestion
Example of the data is below
+-----------+------+-----------+-----------+-------------------+
| branch_id | hour | feature_2 | feature_3 | Count of customer |
+-----------+------+-----------+-----------+-------------------+
| 1 | 12 | .. | .. | 19 |
| 1 | 01 | .. | .. | 25 |
| 2 | 23 | .. | .. | 14 |
| 2 | 01 | .. | .. | 5 |
+-----------+------+-----------+-----------+-------------------+
machine-learning feature-selection
New contributor
$endgroup$
add a comment |
$begingroup$
I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.
I know it will be bad idea to pit id(branch_id
in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.
I can think of two solutions, i am not sure which one is right and what is the best practice.
- Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to
600+rest_of_features
. - Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.
Looking for the suggestion
Example of the data is below
+-----------+------+-----------+-----------+-------------------+
| branch_id | hour | feature_2 | feature_3 | Count of customer |
+-----------+------+-----------+-----------+-------------------+
| 1 | 12 | .. | .. | 19 |
| 1 | 01 | .. | .. | 25 |
| 2 | 23 | .. | .. | 14 |
| 2 | 01 | .. | .. | 5 |
+-----------+------+-----------+-----------+-------------------+
machine-learning feature-selection
New contributor
$endgroup$
add a comment |
$begingroup$
I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.
I know it will be bad idea to pit id(branch_id
in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.
I can think of two solutions, i am not sure which one is right and what is the best practice.
- Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to
600+rest_of_features
. - Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.
Looking for the suggestion
Example of the data is below
+-----------+------+-----------+-----------+-------------------+
| branch_id | hour | feature_2 | feature_3 | Count of customer |
+-----------+------+-----------+-----------+-------------------+
| 1 | 12 | .. | .. | 19 |
| 1 | 01 | .. | .. | 25 |
| 2 | 23 | .. | .. | 14 |
| 2 | 01 | .. | .. | 5 |
+-----------+------+-----------+-----------+-------------------+
machine-learning feature-selection
New contributor
$endgroup$
I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.
I know it will be bad idea to pit id(branch_id
in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.
I can think of two solutions, i am not sure which one is right and what is the best practice.
- Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to
600+rest_of_features
. - Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.
Looking for the suggestion
Example of the data is below
+-----------+------+-----------+-----------+-------------------+
| branch_id | hour | feature_2 | feature_3 | Count of customer |
+-----------+------+-----------+-----------+-------------------+
| 1 | 12 | .. | .. | 19 |
| 1 | 01 | .. | .. | 25 |
| 2 | 23 | .. | .. | 14 |
| 2 | 01 | .. | .. | 5 |
+-----------+------+-----------+-----------+-------------------+
machine-learning feature-selection
machine-learning feature-selection
New contributor
New contributor
New contributor
asked 17 mins ago
mashrafmashraf
1
1
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
In my opinion including id as a feature will not make sense at all, because the model will treat the id as a numeric value which will decrease the model performance, because there should be no connection how big the id is and how many customers there are for that id.
Option 2 can make sense if you have enough data for every branch.
My suggestion will be to look deep into your features and try to find a feature which will replace the branch id. Let's say the number of supporting desks in a branch or the location of a branch as a categorical value. If you find enough features that can describe the specifics of branches, then no need to include ids or to do it separately.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
mashraf is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47388%2fincluding-identifier-in-machine-learning-model-as-feature-vs-separate-model-for%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In my opinion including id as a feature will not make sense at all, because the model will treat the id as a numeric value which will decrease the model performance, because there should be no connection how big the id is and how many customers there are for that id.
Option 2 can make sense if you have enough data for every branch.
My suggestion will be to look deep into your features and try to find a feature which will replace the branch id. Let's say the number of supporting desks in a branch or the location of a branch as a categorical value. If you find enough features that can describe the specifics of branches, then no need to include ids or to do it separately.
New contributor
$endgroup$
add a comment |
$begingroup$
In my opinion including id as a feature will not make sense at all, because the model will treat the id as a numeric value which will decrease the model performance, because there should be no connection how big the id is and how many customers there are for that id.
Option 2 can make sense if you have enough data for every branch.
My suggestion will be to look deep into your features and try to find a feature which will replace the branch id. Let's say the number of supporting desks in a branch or the location of a branch as a categorical value. If you find enough features that can describe the specifics of branches, then no need to include ids or to do it separately.
New contributor
$endgroup$
add a comment |
$begingroup$
In my opinion including id as a feature will not make sense at all, because the model will treat the id as a numeric value which will decrease the model performance, because there should be no connection how big the id is and how many customers there are for that id.
Option 2 can make sense if you have enough data for every branch.
My suggestion will be to look deep into your features and try to find a feature which will replace the branch id. Let's say the number of supporting desks in a branch or the location of a branch as a categorical value. If you find enough features that can describe the specifics of branches, then no need to include ids or to do it separately.
New contributor
$endgroup$
In my opinion including id as a feature will not make sense at all, because the model will treat the id as a numeric value which will decrease the model performance, because there should be no connection how big the id is and how many customers there are for that id.
Option 2 can make sense if you have enough data for every branch.
My suggestion will be to look deep into your features and try to find a feature which will replace the branch id. Let's say the number of supporting desks in a branch or the location of a branch as a categorical value. If you find enough features that can describe the specifics of branches, then no need to include ids or to do it separately.
New contributor
New contributor
answered 2 mins ago
Karen DanielyanKaren Danielyan
1
1
New contributor
New contributor
add a comment |
add a comment |
mashraf is a new contributor. Be nice, and check out our Code of Conduct.
mashraf is a new contributor. Be nice, and check out our Code of Conduct.
mashraf is a new contributor. Be nice, and check out our Code of Conduct.
mashraf is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47388%2fincluding-identifier-in-machine-learning-model-as-feature-vs-separate-model-for%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown