Should I rescale tfidf features?
$begingroup$
I have a dataset which contains both text and numeric features.
I have encoded the text ones using the TfidfVectorizer from sklearn.
I would now like to apply logistic regression to the resulting dataframe.
My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.
I'm unsure about whether to:
scale the whole dataframe with StandardScaler prior to passing to a classifier;
only scale the numeric features, and leave the ones resulting from tfidf as they are.
nlp feature-engineering feature-scaling tfidf
$endgroup$
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I have a dataset which contains both text and numeric features.
I have encoded the text ones using the TfidfVectorizer from sklearn.
I would now like to apply logistic regression to the resulting dataframe.
My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.
I'm unsure about whether to:
scale the whole dataframe with StandardScaler prior to passing to a classifier;
only scale the numeric features, and leave the ones resulting from tfidf as they are.
nlp feature-engineering feature-scaling tfidf
$endgroup$
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I have a dataset which contains both text and numeric features.
I have encoded the text ones using the TfidfVectorizer from sklearn.
I would now like to apply logistic regression to the resulting dataframe.
My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.
I'm unsure about whether to:
scale the whole dataframe with StandardScaler prior to passing to a classifier;
only scale the numeric features, and leave the ones resulting from tfidf as they are.
nlp feature-engineering feature-scaling tfidf
$endgroup$
I have a dataset which contains both text and numeric features.
I have encoded the text ones using the TfidfVectorizer from sklearn.
I would now like to apply logistic regression to the resulting dataframe.
My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.
I'm unsure about whether to:
scale the whole dataframe with StandardScaler prior to passing to a classifier;
only scale the numeric features, and leave the ones resulting from tfidf as they are.
nlp feature-engineering feature-scaling tfidf
nlp feature-engineering feature-scaling tfidf
asked Jun 27 '18 at 16:30
ignoring_gravityignoring_gravity
163
163
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.
According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:
(it's) (...) usually is a two-fold normalization.
First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".
Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.
Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.
But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.
However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.
https://stats.stackexchange.com/a/120600/90513
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33730%2fshould-i-rescale-tfidf-features%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.
According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:
(it's) (...) usually is a two-fold normalization.
First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".
Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.
Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.
But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.
However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.
https://stats.stackexchange.com/a/120600/90513
$endgroup$
add a comment |
$begingroup$
The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.
According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:
(it's) (...) usually is a two-fold normalization.
First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".
Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.
Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.
But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.
However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.
https://stats.stackexchange.com/a/120600/90513
$endgroup$
add a comment |
$begingroup$
The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.
According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:
(it's) (...) usually is a two-fold normalization.
First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".
Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.
Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.
But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.
However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.
https://stats.stackexchange.com/a/120600/90513
$endgroup$
The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.
According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:
(it's) (...) usually is a two-fold normalization.
First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".
Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.
Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.
But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.
However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.
https://stats.stackexchange.com/a/120600/90513
edited Dec 23 '18 at 0:21
answered Dec 23 '18 at 0:13
wacaxwacax
1,91021038
1,91021038
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33730%2fshould-i-rescale-tfidf-features%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown