How to incorporate an attribute that only exists in some observations?
$begingroup$
In a binary classification problem, some of my observations have an event that occurs. I can, obviously, add a 1/0 flag if the event occurs ("event_occurred" in the data below). However, my intuition is that the class is related to the number of days since that event occurred. I'd like to somehow include the number of days since the event occurred in my model ("days_since_event").
Example python data:
import pandas as pd
df = pd.DataFrame({'event_date':
pd.Series(['2019-02-25','','2019-01-31','','2019-03-03']),
'event_occurred': pd.Series([1,0,1,0,1]),
'days_since_event': pd.Series([42, '', 67, '', 36]),
'class': pd.Series([1,2,2,1,1])})
event_date event_occurred days_since_event class
0 2019-02-25 1 42 1
1 0 2
2 2019-01-31 1 67 2
3 0 1
4 2019-03-03 1 36 1
Is this a standard missing data problem or is there a way to better represent this data in a model-friendly format? Is this a situation where I can fill the missing observations with a global value and trust that the model will learn to ignore that value if "event_occurred" is 0?
feature-extraction feature-engineering missing-data
New contributor
$endgroup$
add a comment |
$begingroup$
In a binary classification problem, some of my observations have an event that occurs. I can, obviously, add a 1/0 flag if the event occurs ("event_occurred" in the data below). However, my intuition is that the class is related to the number of days since that event occurred. I'd like to somehow include the number of days since the event occurred in my model ("days_since_event").
Example python data:
import pandas as pd
df = pd.DataFrame({'event_date':
pd.Series(['2019-02-25','','2019-01-31','','2019-03-03']),
'event_occurred': pd.Series([1,0,1,0,1]),
'days_since_event': pd.Series([42, '', 67, '', 36]),
'class': pd.Series([1,2,2,1,1])})
event_date event_occurred days_since_event class
0 2019-02-25 1 42 1
1 0 2
2 2019-01-31 1 67 2
3 0 1
4 2019-03-03 1 36 1
Is this a standard missing data problem or is there a way to better represent this data in a model-friendly format? Is this a situation where I can fill the missing observations with a global value and trust that the model will learn to ignore that value if "event_occurred" is 0?
feature-extraction feature-engineering missing-data
New contributor
$endgroup$
$begingroup$
Do you have any particular model in mind? That might help answer your question. Most tree based models, for example, would be able to handle this kind of situation without having to replace the missing values.
$endgroup$
– oW_
8 hours ago
$begingroup$
I've been using logistic regression but I wanted to try gradient-boosted decision trees via LightGBM too. It looks like LightGBM can handle missing values out of the box like you suggested. I'll give that a try, thanks!
$endgroup$
– Riebeckite
4 hours ago
add a comment |
$begingroup$
In a binary classification problem, some of my observations have an event that occurs. I can, obviously, add a 1/0 flag if the event occurs ("event_occurred" in the data below). However, my intuition is that the class is related to the number of days since that event occurred. I'd like to somehow include the number of days since the event occurred in my model ("days_since_event").
Example python data:
import pandas as pd
df = pd.DataFrame({'event_date':
pd.Series(['2019-02-25','','2019-01-31','','2019-03-03']),
'event_occurred': pd.Series([1,0,1,0,1]),
'days_since_event': pd.Series([42, '', 67, '', 36]),
'class': pd.Series([1,2,2,1,1])})
event_date event_occurred days_since_event class
0 2019-02-25 1 42 1
1 0 2
2 2019-01-31 1 67 2
3 0 1
4 2019-03-03 1 36 1
Is this a standard missing data problem or is there a way to better represent this data in a model-friendly format? Is this a situation where I can fill the missing observations with a global value and trust that the model will learn to ignore that value if "event_occurred" is 0?
feature-extraction feature-engineering missing-data
New contributor
$endgroup$
In a binary classification problem, some of my observations have an event that occurs. I can, obviously, add a 1/0 flag if the event occurs ("event_occurred" in the data below). However, my intuition is that the class is related to the number of days since that event occurred. I'd like to somehow include the number of days since the event occurred in my model ("days_since_event").
Example python data:
import pandas as pd
df = pd.DataFrame({'event_date':
pd.Series(['2019-02-25','','2019-01-31','','2019-03-03']),
'event_occurred': pd.Series([1,0,1,0,1]),
'days_since_event': pd.Series([42, '', 67, '', 36]),
'class': pd.Series([1,2,2,1,1])})
event_date event_occurred days_since_event class
0 2019-02-25 1 42 1
1 0 2
2 2019-01-31 1 67 2
3 0 1
4 2019-03-03 1 36 1
Is this a standard missing data problem or is there a way to better represent this data in a model-friendly format? Is this a situation where I can fill the missing observations with a global value and trust that the model will learn to ignore that value if "event_occurred" is 0?
feature-extraction feature-engineering missing-data
feature-extraction feature-engineering missing-data
New contributor
New contributor
edited 9 hours ago
Riebeckite
New contributor
asked 9 hours ago
RiebeckiteRiebeckite
12
12
New contributor
New contributor
$begingroup$
Do you have any particular model in mind? That might help answer your question. Most tree based models, for example, would be able to handle this kind of situation without having to replace the missing values.
$endgroup$
– oW_
8 hours ago
$begingroup$
I've been using logistic regression but I wanted to try gradient-boosted decision trees via LightGBM too. It looks like LightGBM can handle missing values out of the box like you suggested. I'll give that a try, thanks!
$endgroup$
– Riebeckite
4 hours ago
add a comment |
$begingroup$
Do you have any particular model in mind? That might help answer your question. Most tree based models, for example, would be able to handle this kind of situation without having to replace the missing values.
$endgroup$
– oW_
8 hours ago
$begingroup$
I've been using logistic regression but I wanted to try gradient-boosted decision trees via LightGBM too. It looks like LightGBM can handle missing values out of the box like you suggested. I'll give that a try, thanks!
$endgroup$
– Riebeckite
4 hours ago
$begingroup$
Do you have any particular model in mind? That might help answer your question. Most tree based models, for example, would be able to handle this kind of situation without having to replace the missing values.
$endgroup$
– oW_
8 hours ago
$begingroup$
Do you have any particular model in mind? That might help answer your question. Most tree based models, for example, would be able to handle this kind of situation without having to replace the missing values.
$endgroup$
– oW_
8 hours ago
$begingroup$
I've been using logistic regression but I wanted to try gradient-boosted decision trees via LightGBM too. It looks like LightGBM can handle missing values out of the box like you suggested. I'll give that a try, thanks!
$endgroup$
– Riebeckite
4 hours ago
$begingroup$
I've been using logistic regression but I wanted to try gradient-boosted decision trees via LightGBM too. It looks like LightGBM can handle missing values out of the box like you suggested. I'll give that a try, thanks!
$endgroup$
– Riebeckite
4 hours ago
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Riebeckite is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48913%2fhow-to-incorporate-an-attribute-that-only-exists-in-some-observations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Riebeckite is a new contributor. Be nice, and check out our Code of Conduct.
Riebeckite is a new contributor. Be nice, and check out our Code of Conduct.
Riebeckite is a new contributor. Be nice, and check out our Code of Conduct.
Riebeckite is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48913%2fhow-to-incorporate-an-attribute-that-only-exists-in-some-observations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Do you have any particular model in mind? That might help answer your question. Most tree based models, for example, would be able to handle this kind of situation without having to replace the missing values.
$endgroup$
– oW_
8 hours ago
$begingroup$
I've been using logistic regression but I wanted to try gradient-boosted decision trees via LightGBM too. It looks like LightGBM can handle missing values out of the box like you suggested. I'll give that a try, thanks!
$endgroup$
– Riebeckite
4 hours ago