What kind of regression model should I do?
$begingroup$
my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.
I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:
and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?
My problem is about Dependent variable:
since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.
So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?
should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.
I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.
regression research
$endgroup$
bumped to the homepage by Community♦ 9 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.
I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:
and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?
My problem is about Dependent variable:
since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.
So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?
should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.
I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.
regression research
$endgroup$
bumped to the homepage by Community♦ 9 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31
add a comment |
$begingroup$
my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.
I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:
and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?
My problem is about Dependent variable:
since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.
So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?
should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.
I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.
regression research
$endgroup$
my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.
I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:
and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?
My problem is about Dependent variable:
since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.
So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?
should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.
I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.
regression research
regression research
edited Feb 11 '18 at 7:26
Franck Dernoncourt
3,52622365
3,52622365
asked Jan 15 '17 at 2:04
user27954user27954
211
211
bumped to the homepage by Community♦ 9 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 9 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31
add a comment |
$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31
$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31
$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.
So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.
$endgroup$
add a comment |
$begingroup$
I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).
My thoughts are that in the case of two users:
a) A very active user who was on a long vacation.
b) A new user - who had one action(only on sign up day)
Might have the same sustained-participation metric - if measured as a function of time passed since last action.
But we expect the community to react differently to their actions.
A model might look like:
attention = M(segment_type, time_since_last_activity).
segment_type = G(activity_signals_until_now)
Where activity_signal_until_now may consist:
- total action
- time since first action
- average time between actions
M can be a simple Regressor.
G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.
$endgroup$
add a comment |
$begingroup$
From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.
Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16323%2fwhat-kind-of-regression-model-should-i-do%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.
So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.
$endgroup$
add a comment |
$begingroup$
One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.
So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.
$endgroup$
add a comment |
$begingroup$
One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.
So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.
$endgroup$
One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.
So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.
answered Jan 15 '17 at 18:56
oW_oW_
3,306933
3,306933
add a comment |
add a comment |
$begingroup$
I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).
My thoughts are that in the case of two users:
a) A very active user who was on a long vacation.
b) A new user - who had one action(only on sign up day)
Might have the same sustained-participation metric - if measured as a function of time passed since last action.
But we expect the community to react differently to their actions.
A model might look like:
attention = M(segment_type, time_since_last_activity).
segment_type = G(activity_signals_until_now)
Where activity_signal_until_now may consist:
- total action
- time since first action
- average time between actions
M can be a simple Regressor.
G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.
$endgroup$
add a comment |
$begingroup$
I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).
My thoughts are that in the case of two users:
a) A very active user who was on a long vacation.
b) A new user - who had one action(only on sign up day)
Might have the same sustained-participation metric - if measured as a function of time passed since last action.
But we expect the community to react differently to their actions.
A model might look like:
attention = M(segment_type, time_since_last_activity).
segment_type = G(activity_signals_until_now)
Where activity_signal_until_now may consist:
- total action
- time since first action
- average time between actions
M can be a simple Regressor.
G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.
$endgroup$
add a comment |
$begingroup$
I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).
My thoughts are that in the case of two users:
a) A very active user who was on a long vacation.
b) A new user - who had one action(only on sign up day)
Might have the same sustained-participation metric - if measured as a function of time passed since last action.
But we expect the community to react differently to their actions.
A model might look like:
attention = M(segment_type, time_since_last_activity).
segment_type = G(activity_signals_until_now)
Where activity_signal_until_now may consist:
- total action
- time since first action
- average time between actions
M can be a simple Regressor.
G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.
$endgroup$
I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).
My thoughts are that in the case of two users:
a) A very active user who was on a long vacation.
b) A new user - who had one action(only on sign up day)
Might have the same sustained-participation metric - if measured as a function of time passed since last action.
But we expect the community to react differently to their actions.
A model might look like:
attention = M(segment_type, time_since_last_activity).
segment_type = G(activity_signals_until_now)
Where activity_signal_until_now may consist:
- total action
- time since first action
- average time between actions
M can be a simple Regressor.
G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.
answered Jan 16 '17 at 12:59
yoav_aaayoav_aaa
626212
626212
add a comment |
add a comment |
$begingroup$
From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.
Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.
$endgroup$
add a comment |
$begingroup$
From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.
Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.
$endgroup$
add a comment |
$begingroup$
From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.
Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.
$endgroup$
From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.
Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.
answered Aug 15 '17 at 1:31
Dan HicksDan Hicks
1113
1113
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16323%2fwhat-kind-of-regression-model-should-i-do%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31