Why we should not feed LDA with tfidf

Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?

edited Aug 4 '17 at 5:06

asked Aug 4 '17 at 3:56

sariii

214

1

$begingroup$
Because LDA is based on term counts and document counts.
$endgroup$
– Blue482
Aug 6 '17 at 15:16

$begingroup$
@Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
$endgroup$
– sariii
Aug 6 '17 at 18:01

$begingroup$
@Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
$endgroup$
– sariii
Aug 6 '17 at 18:18

$begingroup$
I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
$endgroup$
– sariii
Aug 6 '17 at 19:18

$begingroup$
@Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
$endgroup$
– sariii
Aug 7 '17 at 4:55

add a comment |

Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?

edited Aug 4 '17 at 5:06

asked Aug 4 '17 at 3:56

sariii

214

1

$begingroup$
Because LDA is based on term counts and document counts.
$endgroup$
– Blue482
Aug 6 '17 at 15:16

$begingroup$
@Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
$endgroup$
– sariii
Aug 6 '17 at 18:01

$begingroup$
@Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
$endgroup$
– sariii
Aug 6 '17 at 18:18

$begingroup$
I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
$endgroup$
– sariii
Aug 6 '17 at 19:18

$begingroup$
@Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
$endgroup$
– sariii
Aug 7 '17 at 4:55

add a comment |

Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?

edited Aug 4 '17 at 5:06

asked Aug 4 '17 at 3:56

sariii

214

Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?

machine-learning python topic-model lda

edited Aug 4 '17 at 5:06

asked Aug 4 '17 at 3:56

sariii

214

edited Aug 4 '17 at 5:06

asked Aug 4 '17 at 3:56

sariii

214

edited Aug 4 '17 at 5:06

asked Aug 4 '17 at 3:56

sariii

214

asked Aug 4 '17 at 3:56

sariii

214

asked Aug 4 '17 at 3:56

sariii

214

1

$begingroup$
Because LDA is based on term counts and document counts.
$endgroup$
– Blue482
Aug 6 '17 at 15:16

$begingroup$
@Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
$endgroup$
– sariii
Aug 6 '17 at 18:01

$begingroup$
@Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
$endgroup$
– sariii
Aug 6 '17 at 18:18

$begingroup$
I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
$endgroup$
– sariii
Aug 6 '17 at 19:18

$begingroup$
@Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
$endgroup$
– sariii
Aug 7 '17 at 4:55

add a comment |

1

$begingroup$
Because LDA is based on term counts and document counts.
$endgroup$
– Blue482
Aug 6 '17 at 15:16

$begingroup$
@Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?
$endgroup$
– sariii
Aug 6 '17 at 18:01

$begingroup$
@Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)
$endgroup$
– sariii
Aug 6 '17 at 18:18

$begingroup$
I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…
$endgroup$
– sariii
Aug 6 '17 at 19:18

$begingroup$
@Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that
$endgroup$
– sariii
Aug 7 '17 at 4:55

Because LDA is based on term counts and document counts.

– Blue482
Aug 6 '17 at 15:16

@Blue482 many thanks for the answer :), may I ask you to explain more? I know the concept behind TFIDF and LDA, but I can't understand what will be wrong if we feed LDA with a vector which is times of terms counts and the weight in each document?

– sariii
Aug 6 '17 at 18:01

@Blue482 Also may I ask you to provide your answer in the second post,datascience.stackexchange.com/questions/21947/… because I can not comment in the first post as I explained. i made that as guest, and guest can not put comment, because of that I create my account and create another post, I really appreciate your help, needs your insight in the result. thanks :)

– sariii
Aug 6 '17 at 18:18

I've created another post in stackoverflow. as I can not follow your answer, may I ask you to follow there(I updated my answer there and it seems working but still some questions about the output)? as it would be hard managing in this way, I really appreciate your help in advance, stackoverflow.com/questions/45535277/…

– sariii
Aug 6 '17 at 19:18

@Blue482 I got why that is incorrect :|, it would be really good if you could add your explanations as answer, so I will accept that

– sariii
Aug 7 '17 at 4:55

add a comment |

1 Answer
1

active

oldest

votes

Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915

Direct quote:

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

That sums it up on the high level. It would be interesting to understand more technically, why the model would perform more poorly if TF-IDF is used. Actually, there is another reply in the SO link which claims that LDA can be improved with TF-IDF.

answered 8 hours ago

Lazer

112

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f21950%2fwhy-we-should-not-feed-lda-with-tfidf%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915

Direct quote:

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

answered 8 hours ago

Lazer

112

New contributor

add a comment |

Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915

Direct quote:

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

answered 8 hours ago

Lazer

112

New contributor

add a comment |

Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915

Direct quote:

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

answered 8 hours ago

Lazer

112

New contributor

Since the StackOverflow link in the question comments seems broken, here is another reply that addresses the same question: https://stackoverflow.com/a/44789327/6470915

Direct quote:

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

answered 8 hours ago

Lazer

112

New contributor

answered 8 hours ago

Lazer

112

New contributor

answered 8 hours ago

Lazer

112

answered 8 hours ago

Lazer

112

New contributor

Lazer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki