Should I rescale tfidf features?

I have a dataset which contains both text and numeric features.

I have encoded the text ones using the TfidfVectorizer from sklearn.

I would now like to apply logistic regression to the resulting dataframe.

My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.

I'm unsure about whether to:

scale the whole dataframe with StandardScaler prior to passing to a classifier;

only scale the numeric features, and leave the ones resulting from tfidf as they are.

asked Jun 27 '18 at 16:30

ignoring_gravity

163

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have a dataset which contains both text and numeric features.

I have encoded the text ones using the TfidfVectorizer from sklearn.

I would now like to apply logistic regression to the resulting dataframe.

My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.

I'm unsure about whether to:

scale the whole dataframe with StandardScaler prior to passing to a classifier;

only scale the numeric features, and leave the ones resulting from tfidf as they are.

asked Jun 27 '18 at 16:30

ignoring_gravity

163

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I have a dataset which contains both text and numeric features.

I have encoded the text ones using the TfidfVectorizer from sklearn.

I would now like to apply logistic regression to the resulting dataframe.

My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.

I'm unsure about whether to:

scale the whole dataframe with StandardScaler prior to passing to a classifier;

only scale the numeric features, and leave the ones resulting from tfidf as they are.

asked Jun 27 '18 at 16:30

ignoring_gravity

163

I have a dataset which contains both text and numeric features.

I have encoded the text ones using the TfidfVectorizer from sklearn.

I would now like to apply logistic regression to the resulting dataframe.

My issue is that the numeric features aren't on the same scale as the ones resulting from tfidf.

I'm unsure about whether to:

scale the whole dataframe with StandardScaler prior to passing to a classifier;

only scale the numeric features, and leave the ones resulting from tfidf as they are.

nlp feature-engineering feature-scaling tfidf

asked Jun 27 '18 at 16:30

ignoring_gravity

163

asked Jun 27 '18 at 16:30

ignoring_gravity

163

asked Jun 27 '18 at 16:30

ignoring_gravity

163

asked Jun 27 '18 at 16:30

ignoring_gravity

163

asked Jun 27 '18 at 16:30

ignoring_gravity

163

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 7 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.

According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:

(it's) (...) usually is a two-fold normalization.

First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".

Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.

Tf-Idf is meant to be used in its raw form in an algorithm. Other numerical values are the ones that could be normalized if the algorithm needs normalization or the data is just too small. Other options can be using algorithms resistant to different ranges and distributions like tree based models or simply using regularization, it's up to the cross-validation results really.

But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.

However, there is also the idea of normalizing one-hot coded variables as something that can be done as a standard step same as in other datasets. And it's presented by a prominent figure in the field of statistics.

https://stats.stackexchange.com/a/120600/90513

edited Dec 23 '18 at 0:21

answered Dec 23 '18 at 0:13

wacax

1,91021038

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33730%2fshould-i-rescale-tfidf-features%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.

According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:

(it's) (...) usually is a two-fold normalization.

First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".

Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.

But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.

https://stats.stackexchange.com/a/120600/90513

edited Dec 23 '18 at 0:21

answered Dec 23 '18 at 0:13

wacax

1,91021038

add a comment |

The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.

According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:

(it's) (...) usually is a two-fold normalization.

First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".

Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.

But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.

https://stats.stackexchange.com/a/120600/90513

edited Dec 23 '18 at 0:21

answered Dec 23 '18 at 0:13

wacax

1,91021038

add a comment |

The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.

According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:

(it's) (...) usually is a two-fold normalization.

First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".

Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.

But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.

https://stats.stackexchange.com/a/120600/90513

edited Dec 23 '18 at 0:21

answered Dec 23 '18 at 0:13

wacax

1,91021038

The most accepted idea is that bag-of-words, Tf-Idf and other transformations should be left as is.

According to some: Standardization of categorical variables might be not natural. Neither is standarization of Tf-Idf because according to stats stack exchange:

(it's) (...) usually is a two-fold normalization.

First, each document is normalized to length 1, so there is no bias
for longer or shorter documents. This equals taking the relative
frequencies instead of the absolute term counts. This is "TF".

Second, IDF then is a cross-document normalization, that puts less
weight on common terms, and more weight on rare terms, by normalizing
(weighting) each word with the inverse in-corpus frequency.

But categorical features like bag-of-words, tf-idf or other nlp transformations should be left alone for better results.

https://stats.stackexchange.com/a/120600/90513

edited Dec 23 '18 at 0:21

answered Dec 23 '18 at 0:13

wacax

1,91021038

edited Dec 23 '18 at 0:21

answered Dec 23 '18 at 0:13

wacax

1,91021038

answered Dec 23 '18 at 0:13

wacax

1,91021038

answered Dec 23 '18 at 0:13

wacax

1,91021038

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki