What kind of regression model should I do?












4












$begingroup$


my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.



I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:



enter image description here



and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?



My problem is about Dependent variable:



since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.



So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?



should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.



I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.










share|improve this question











$endgroup$




bumped to the homepage by Community 9 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
    $endgroup$
    – Paul
    Jan 16 '17 at 13:31


















4












$begingroup$


my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.



I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:



enter image description here



and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?



My problem is about Dependent variable:



since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.



So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?



should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.



I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.










share|improve this question











$endgroup$




bumped to the homepage by Community 9 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
    $endgroup$
    – Paul
    Jan 16 '17 at 13:31
















4












4








4


1



$begingroup$


my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.



I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:



enter image description here



and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?



My problem is about Dependent variable:



since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.



So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?



should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.



I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.










share|improve this question











$endgroup$




my research question is the examine the effect of "receiving attention" from other members in an online community on "sustained participation" on the website.



I decided to measure "sustained participation" of each user by calculating average time difference between the submissions of the user. I calculated it in the following way:



enter image description here



and I measured "attention" by calculating total number of the comments each user received for all the submissions he/she has posted.I also want to consider total number of votes and total number of views as well. I am not sure if it is a good idea to add those as the independent variables into model too or not?



My problem is about Dependent variable:



since some people just participated two times in two successive days and the average between submission days is 1. and some people participated 100 times and their average between submission days of them is also 1. but it is obvious that the second group that have participated 100 times had sustained participation not the first group.



So I need to consider the number of the submissions in the model too! I do not know if there is a way to consider it? how I can handle this problem?



should I group the users and do data analysis separately for them. for example users who have participated less than 10 times in one group! users with 10-20 participation in another group and etc.



I would appreciate if anyone can help me! my paper's due is so close and I need some preliminary results.







regression research






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 11 '18 at 7:26









Franck Dernoncourt

3,52622365




3,52622365










asked Jan 15 '17 at 2:04









user27954user27954

211




211





bumped to the homepage by Community 9 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 9 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • $begingroup$
    Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
    $endgroup$
    – Paul
    Jan 16 '17 at 13:31




















  • $begingroup$
    Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
    $endgroup$
    – Paul
    Jan 16 '17 at 13:31


















$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31






$begingroup$
Quantification of the dependent variable will be more complicated than you state here, because the timing of the comments received are just as important as the two dimensions of participation you've already talked about.
$endgroup$
– Paul
Jan 16 '17 at 13:31












3 Answers
3






active

oldest

votes


















0












$begingroup$

One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.



So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.






share|improve this answer









$endgroup$





















    0












    $begingroup$

    I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).



    My thoughts are that in the case of two users:

    a) A very active user who was on a long vacation.

    b) A new user - who had one action(only on sign up day)

    Might have the same sustained-participation metric - if measured as a function of time passed since last action.

    But we expect the community to react differently to their actions.



    A model might look like:

    attention = M(segment_type, time_since_last_activity).

    segment_type = G(activity_signals_until_now)



    Where activity_signal_until_now may consist:

    - total action

    - time since first action

    - average time between actions



    M can be a simple Regressor.

    G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.






    share|improve this answer









    $endgroup$





















      0












      $begingroup$

      From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.



      Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.






      share|improve this answer









      $endgroup$














        Your Answer





        StackExchange.ifUsing("editor", function () {
        return StackExchange.using("mathjaxEditing", function () {
        StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
        StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
        });
        });
        }, "mathjax-editing");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "557"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16323%2fwhat-kind-of-regression-model-should-i-do%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        0












        $begingroup$

        One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.



        So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.






        share|improve this answer









        $endgroup$


















          0












          $begingroup$

          One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.



          So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.






          share|improve this answer









          $endgroup$
















            0












            0








            0





            $begingroup$

            One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.



            So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.






            share|improve this answer









            $endgroup$



            One thing you can do about your 'participation' variable is including the beginning and end of your data window. Let's say your data ranges from $start=$ 1/1/2016 to $end=$ 1/1/2017. Instead of only calculating the difference between the second and the first post, you'd calculate the difference between the first post and 1/1/2016 and then the difference between the second and the first post. (If some people join the platform after 1/1/2016 then you'd take the minimum of the join date and 1/1/2016). And you'd also calculate the difference between 1/1/2017 and last post.



            So, if someone only had two posts $p_1$ and $p_2$, you would get the difference $(p_1-start, p_2-p_1,end-p_2)$. The differences $p_1-start$ and $end-p_2$ would then be larger than the differences $p_1-start$ and $end-p_n$ for some with $n>>2$ posts.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 15 '17 at 18:56









            oW_oW_

            3,306933




            3,306933























                0












                $begingroup$

                I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).



                My thoughts are that in the case of two users:

                a) A very active user who was on a long vacation.

                b) A new user - who had one action(only on sign up day)

                Might have the same sustained-participation metric - if measured as a function of time passed since last action.

                But we expect the community to react differently to their actions.



                A model might look like:

                attention = M(segment_type, time_since_last_activity).

                segment_type = G(activity_signals_until_now)



                Where activity_signal_until_now may consist:

                - total action

                - time since first action

                - average time between actions



                M can be a simple Regressor.

                G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.






                share|improve this answer









                $endgroup$


















                  0












                  $begingroup$

                  I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).



                  My thoughts are that in the case of two users:

                  a) A very active user who was on a long vacation.

                  b) A new user - who had one action(only on sign up day)

                  Might have the same sustained-participation metric - if measured as a function of time passed since last action.

                  But we expect the community to react differently to their actions.



                  A model might look like:

                  attention = M(segment_type, time_since_last_activity).

                  segment_type = G(activity_signals_until_now)



                  Where activity_signal_until_now may consist:

                  - total action

                  - time since first action

                  - average time between actions



                  M can be a simple Regressor.

                  G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.






                  share|improve this answer









                  $endgroup$
















                    0












                    0








                    0





                    $begingroup$

                    I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).



                    My thoughts are that in the case of two users:

                    a) A very active user who was on a long vacation.

                    b) A new user - who had one action(only on sign up day)

                    Might have the same sustained-participation metric - if measured as a function of time passed since last action.

                    But we expect the community to react differently to their actions.



                    A model might look like:

                    attention = M(segment_type, time_since_last_activity).

                    segment_type = G(activity_signals_until_now)



                    Where activity_signal_until_now may consist:

                    - total action

                    - time since first action

                    - average time between actions



                    M can be a simple Regressor.

                    G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.






                    share|improve this answer









                    $endgroup$



                    I would segment the users on the site by their overall activity measure until the point of test, and train a model using these segments as a categorical variable(or train a different model for each).



                    My thoughts are that in the case of two users:

                    a) A very active user who was on a long vacation.

                    b) A new user - who had one action(only on sign up day)

                    Might have the same sustained-participation metric - if measured as a function of time passed since last action.

                    But we expect the community to react differently to their actions.



                    A model might look like:

                    attention = M(segment_type, time_since_last_activity).

                    segment_type = G(activity_signals_until_now)



                    Where activity_signal_until_now may consist:

                    - total action

                    - time since first action

                    - average time between actions



                    M can be a simple Regressor.

                    G could be supervised(if you have prior on what segments you have) or unsupervised using some kind of clustering algorithm.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jan 16 '17 at 12:59









                    yoav_aaayoav_aaa

                    626212




                    626212























                        0












                        $begingroup$

                        From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.



                        Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.






                        share|improve this answer









                        $endgroup$


















                          0












                          $begingroup$

                          From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.



                          Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.






                          share|improve this answer









                          $endgroup$
















                            0












                            0








                            0





                            $begingroup$

                            From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.



                            Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.






                            share|improve this answer









                            $endgroup$



                            From a stats perspective, it sounds like you have a Poisson process, where the events are user submissions. So you might represent your dependent variable as the number of events in a unit of time (say, number of submissions per week), and set up a Poisson regression or negative binomial regression. For the independent variable, you might try the number of comments received in the previous unit of time, i.e., you might see how well the number of months received this week predicts the number of submission next week.



                            Note that there are likely to be trends over time (i.e., a new user doesn't make many submission at first, but gradually becomes a regular user) and autocorrelation, especially if your time scale is too small. For example, in your example data, most days have 0 submissions, so autocorrelation will high if your units are days or smaller. So consider using time series methods or selecting a time scale large enough to get low autocorrelation.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Aug 15 '17 at 1:31









                            Dan HicksDan Hicks

                            1113




                            1113






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f16323%2fwhat-kind-of-regression-model-should-i-do%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Ponta tanko

                                Tantalo (mitologio)

                                Erzsébet Schaár