Why we should not feed LDA with tfidf












4












$begingroup$


Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?










share|improve this question











$endgroup$








  • 1




    $begingroup$
    Because LDA is based on term counts and document counts.
    $endgroup$
    – Blue482
    Aug 6 '17 at 15:16










  • $begingroup$
    @Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
    $endgroup$
    – sariii
    Aug 6 '17 at 18:01










  • $begingroup$
    @Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
    $endgroup$
    – sariii
    Aug 6 '17 at 18:18










  • $begingroup$
    I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
    $endgroup$
    – sariii
    Aug 6 '17 at 19:18










  • $begingroup$
    @Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
    $endgroup$
    – sariii
    Aug 7 '17 at 4:55
















4












$begingroup$


Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?










share|improve this question











$endgroup$








  • 1




    $begingroup$
    Because LDA is based on term counts and document counts.
    $endgroup$
    – Blue482
    Aug 6 '17 at 15:16










  • $begingroup$
    @Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
    $endgroup$
    – sariii
    Aug 6 '17 at 18:01










  • $begingroup$
    @Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
    $endgroup$
    – sariii
    Aug 6 '17 at 18:18










  • $begingroup$
    I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
    $endgroup$
    – sariii
    Aug 6 '17 at 19:18










  • $begingroup$
    @Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
    $endgroup$
    – sariii
    Aug 7 '17 at 4:55














4












4








4


2



$begingroup$


Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?










share|improve this question











$endgroup$




Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?







machine-learning python topic-model lda






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 4 '17 at 5:06







sariii

















asked Aug 4 '17 at 3:56









sariiisariii

214




214








  • 1




    $begingroup$
    Because LDA is based on term counts and document counts.
    $endgroup$
    – Blue482
    Aug 6 '17 at 15:16










  • $begingroup$
    @Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
    $endgroup$
    – sariii
    Aug 6 '17 at 18:01










  • $begingroup$
    @Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
    $endgroup$
    – sariii
    Aug 6 '17 at 18:18










  • $begingroup$
    I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
    $endgroup$
    – sariii
    Aug 6 '17 at 19:18










  • $begingroup$
    @Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
    $endgroup$
    – sariii
    Aug 7 '17 at 4:55














  • 1




    $begingroup$
    Because LDA is based on term counts and document counts.
    $endgroup$
    – Blue482
    Aug 6 '17 at 15:16










  • $begingroup$
    @Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
    $endgroup$
    – sariii
    Aug 6 '17 at 18:01










  • $begingroup$
    @Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
    $endgroup$
    – sariii
    Aug 6 '17 at 18:18










  • $begingroup$
    I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
    $endgroup$
    – sariii
    Aug 6 '17 at 19:18










  • $begingroup$
    @Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
    $endgroup$
    – sariii
    Aug 7 '17 at 4:55








1




1




$begingroup$
Because LDA is based on term counts and document counts.
$endgroup$
– Blue482
Aug 6 '17 at 15:16




$begingroup$
Because LDA is based on term counts and document counts.
$endgroup$
– Blue482
Aug 6 '17 at 15:16












$begingroup$
@Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
$endgroup$
– sariii
Aug 6 '17 at 18:01




$begingroup$
@Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
$endgroup$
– sariii
Aug 6 '17 at 18:01












$begingroup$
@Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
$endgroup$
– sariii
Aug 6 '17 at 18:18




$begingroup$
@Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
$endgroup$
– sariii
Aug 6 '17 at 18:18












$begingroup$
I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
$endgroup$
– sariii
Aug 6 '17 at 19:18




$begingroup$
I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
$endgroup$
– sariii
Aug 6 '17 at 19:18












$begingroup$
@Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
$endgroup$
– sariii
Aug 7 '17 at 4:55




$begingroup$
@Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
$endgroup$
– sariii
Aug 7 '17 at 4:55










1 Answer
1






active

oldest

votes


















1












$begingroup$

Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915



Direct quote:




In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.




That sums it up on the high level. It would be interesting to understand more technically, why the model would perform more poorly if TF-IDF is used. Actually, there is another reply in the SO link which claims that LDA can be improved with TF-IDF.






share|improve this answer








New contributor




Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$














    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f21950%2fwhy-we-should-not-feed-lda-with-tfidf%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915



    Direct quote:




    In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.




    That sums it up on the high level. It would be interesting to understand more technically, why the model would perform more poorly if TF-IDF is used. Actually, there is another reply in the SO link which claims that LDA can be improved with TF-IDF.






    share|improve this answer








    New contributor




    Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$


















      1












      $begingroup$

      Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915



      Direct quote:




      In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.




      That sums it up on the high level. It would be interesting to understand more technically, why the model would perform more poorly if TF-IDF is used. Actually, there is another reply in the SO link which claims that LDA can be improved with TF-IDF.






      share|improve this answer








      New contributor




      Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$
















        1












        1








        1





        $begingroup$

        Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915



        Direct quote:




        In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.




        That sums it up on the high level. It would be interesting to understand more technically, why the model would perform more poorly if TF-IDF is used. Actually, there is another reply in the SO link which claims that LDA can be improved with TF-IDF.






        share|improve this answer








        New contributor




        Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915



        Direct quote:




        In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.




        That sums it up on the high level. It would be interesting to understand more technically, why the model would perform more poorly if TF-IDF is used. Actually, there is another reply in the SO link which claims that LDA can be improved with TF-IDF.







        share|improve this answer








        New contributor




        Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|improve this answer



        share|improve this answer






        New contributor




        Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 8 hours ago









        LazerLazer

        112




        112




        New contributor




        Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f21950%2fwhy-we-should-not-feed-lda-with-tfidf%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ponta tanko

            Tantalo (mitologio)

            Erzsébet Schaár