Including identifier in machine learning model as feature vs separate model for every identifier

I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.

I know it will be bad idea to pit id(branch_id in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.

I can think of two solutions, i am not sure which one is right and what is the best practice.

Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to 600+rest_of_features.

Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.

Looking for the suggestion

Example of the data is below

    +-----------+------+-----------+-----------+-------------------+

    | branch_id | hour | feature_2 | feature_3 | Count of customer |

    +-----------+------+-----------+-----------+-------------------+

    |         1 |   12 |        .. |        .. |                19 |

    |         1 |   01 |        .. |        .. |                25 |

    |         2 |   23 |        .. |        .. |                14 |

    |         2 |   01 |        .. |        .. |                 5 |

    +-----------+------+-----------+-----------+-------------------+

asked 17 mins ago

mashraf

New contributor

add a comment |

I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.

I know it will be bad idea to pit id(branch_id in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.

I can think of two solutions, i am not sure which one is right and what is the best practice.

Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to 600+rest_of_features.

Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.

Looking for the suggestion

Example of the data is below

    +-----------+------+-----------+-----------+-------------------+

    | branch_id | hour | feature_2 | feature_3 | Count of customer |

    +-----------+------+-----------+-----------+-------------------+

    |         1 |   12 |        .. |        .. |                19 |

    |         1 |   01 |        .. |        .. |                25 |

    |         2 |   23 |        .. |        .. |                14 |

    |         2 |   01 |        .. |        .. |                 5 |

    +-----------+------+-----------+-----------+-------------------+

asked 17 mins ago

mashraf

New contributor

add a comment |

I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.

I know it will be bad idea to pit id(branch_id in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.

I can think of two solutions, i am not sure which one is right and what is the best practice.

Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to 600+rest_of_features.

Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.

Looking for the suggestion

Example of the data is below

    +-----------+------+-----------+-----------+-------------------+

    | branch_id | hour | feature_2 | feature_3 | Count of customer |

    +-----------+------+-----------+-----------+-------------------+

    |         1 |   12 |        .. |        .. |                19 |

    |         1 |   01 |        .. |        .. |                25 |

    |         2 |   23 |        .. |        .. |                14 |

    |         2 |   01 |        .. |        .. |                 5 |

    +-----------+------+-----------+-----------+-------------------+

asked 17 mins ago

mashraf

New contributor

I am new to machine learning and i am building a model to predict number of customers for the model branch at specific hour/season/other feature.

I know it will be bad idea to pit id(branch_id in my case) into model but customer count in this case hugely depend on which branch it is so i cannot exclude it.

I can think of two solutions, i am not sure which one is right and what is the best practice.

Create dummy variable(one hot encoding to avoid wieghing one id more than other) for all branch ids,but since i have 600 unique branch ids my features will go up-to 600+rest_of_features.

Learn a separate model for each of the branch(600 models), i am not sure if it is right approach and also i am not very familiar with this approach and it will be very time consuming.

Looking for the suggestion

Example of the data is below

    +-----------+------+-----------+-----------+-------------------+

    | branch_id | hour | feature_2 | feature_3 | Count of customer |

    +-----------+------+-----------+-----------+-------------------+

    |         1 |   12 |        .. |        .. |                19 |

    |         1 |   01 |        .. |        .. |                25 |

    |         2 |   23 |        .. |        .. |                14 |

    |         2 |   01 |        .. |        .. |                 5 |

    +-----------+------+-----------+-----------+-------------------+

machine-learning feature-selection

asked 17 mins ago

mashraf

New contributor

asked 17 mins ago

mashraf

New contributor

asked 17 mins ago

mashraf

New contributor

asked 17 mins ago

mashraf

asked 17 mins ago

mashraf

New contributor

mashraf is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

In my opinion including id as a feature will not make sense at all, because the model will treat the id as a numeric value which will decrease the model performance, because there should be no connection how big the id is and how many customers there are for that id.

Option 2 can make sense if you have enough data for every branch.

My suggestion will be to look deep into your features and try to find a feature which will replace the branch id. Let's say the number of supporting desks in a branch or the location of a branch as a categorical value. If you find enough features that can describe the specifics of branches, then no need to include ids or to do it separately.

answered 2 mins ago

Karen Danielyan

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

mashraf is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47388%2fincluding-identifier-in-machine-learning-model-as-feature-vs-separate-model-for%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Option 2 can make sense if you have enough data for every branch.

answered 2 mins ago

Karen Danielyan

New contributor

add a comment |

Option 2 can make sense if you have enough data for every branch.

answered 2 mins ago

Karen Danielyan

New contributor

add a comment |

Option 2 can make sense if you have enough data for every branch.

answered 2 mins ago

Karen Danielyan

New contributor

Option 2 can make sense if you have enough data for every branch.

answered 2 mins ago

Karen Danielyan

New contributor

answered 2 mins ago

Karen Danielyan

New contributor

answered 2 mins ago

Karen Danielyan

answered 2 mins ago

Karen Danielyan

New contributor

Karen Danielyan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

mashraf is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

mashraf is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki