tensorflow categorical data with vocabulary list - Expected binary or Unicode string, got [0,1,2,…]
$begingroup$
I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)
The data is formatted in a dataframe like below
|Identity | Cuisine | Ingredients |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |
I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.
|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |
I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list
and subsequent tf.feature_column.indicator_column
for the ingredients array.
I now however have an issue with my model not being able to read the ingredients
column, and get the error
TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
my input function is as follows
def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)
if shuffle:
ds = ds.shuffle(10000)
feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch
which is fed into a simple function as below
training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)
classifier.train(
input_fn=training_func,
steps=steps_per_period
)
My urgent question is how do I fix this TypeError
In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)
python tensorflow dataset linear-regression categorical-data
$endgroup$
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)
The data is formatted in a dataframe like below
|Identity | Cuisine | Ingredients |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |
I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.
|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |
I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list
and subsequent tf.feature_column.indicator_column
for the ingredients array.
I now however have an issue with my model not being able to read the ingredients
column, and get the error
TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
my input function is as follows
def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)
if shuffle:
ds = ds.shuffle(10000)
feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch
which is fed into a simple function as below
training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)
classifier.train(
input_fn=training_func,
steps=steps_per_period
)
My urgent question is how do I fix this TypeError
In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)
python tensorflow dataset linear-regression categorical-data
$endgroup$
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07
add a comment |
$begingroup$
I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)
The data is formatted in a dataframe like below
|Identity | Cuisine | Ingredients |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |
I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.
|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |
I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list
and subsequent tf.feature_column.indicator_column
for the ingredients array.
I now however have an issue with my model not being able to read the ingredients
column, and get the error
TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
my input function is as follows
def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)
if shuffle:
ds = ds.shuffle(10000)
feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch
which is fed into a simple function as below
training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)
classifier.train(
input_fn=training_func,
steps=steps_per_period
)
My urgent question is how do I fix this TypeError
In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)
python tensorflow dataset linear-regression categorical-data
$endgroup$
I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)
The data is formatted in a dataframe like below
|Identity | Cuisine | Ingredients |
|---------|---------|----------------------------|
|1 | italian | [beans, milk,..., tomatoes]|
|2 | indian | [chicken, curry leaf,...] |
I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.
|Identity | Cuisine | Ingredients |
|---------|---------|-------------|
|1 | italian |[0, 1,..., 4]|
|2 | indian |[5, 6,...] |
I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list
and subsequent tf.feature_column.indicator_column
for the ingredients array.
I now however have an issue with my model not being able to read the ingredients
column, and get the error
TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
my input function is as follows
def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
ds = Dataset.from_tensor_slices((features,labels))
ds = ds.batch(batch_size).repeat(num_epochs)
if shuffle:
ds = ds.shuffle(10000)
feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
return feature_batch, label_batch
which is fed into a simple function as below
training_func = lambda: input_fn(training_example,training_target,batch_size)
validati_func = lambda: input_fn(validation_example,validation_target,batch_size)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)
classifier.train(
input_fn=training_func,
steps=steps_per_period
)
My urgent question is how do I fix this TypeError
In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)
python tensorflow dataset linear-regression categorical-data
python tensorflow dataset linear-regression categorical-data
asked Aug 9 '18 at 3:04
Byren HigginByren Higgin
1061
1061
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 7 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
1
$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07
add a comment |
1
$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07
1
1
$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07
$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I'm not completely familiar with TF API, but here's what I think is happening.
The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.
You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.
You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36667%2ftensorflow-categorical-data-with-vocabulary-list-expected-binary-or-unicode-st%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I'm not completely familiar with TF API, but here's what I think is happening.
The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.
You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.
You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.
$endgroup$
add a comment |
$begingroup$
I'm not completely familiar with TF API, but here's what I think is happening.
The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.
You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.
You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.
$endgroup$
add a comment |
$begingroup$
I'm not completely familiar with TF API, but here's what I think is happening.
The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.
You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.
You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.
$endgroup$
I'm not completely familiar with TF API, but here's what I think is happening.
The library tells you that it can handle a binary column or a string. But you have all the ingredients listed in a single column. So the integer conversion of ingredient label is not helping.
You can instead create one column per possible list of ingredient and setting it to 1 if that ingredient is present or absent. For example, Italian cuisine will have column for tomatoes or garlic set to 1 for many records.
You can read more about get_dummies function in pandas library. If the original ingredient list comes in form of text, you can read up more about text feature extraction / bag of words APIs in scikit-learn libary.
answered Aug 9 '18 at 15:11
hssayhssay
1,0931311
1,0931311
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36667%2ftensorflow-categorical-data-with-vocabulary-list-expected-binary-or-unicode-st%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Since this might be a code heavy question, I added my entire code to an online Pastebin paste so you can check out the code. The dataset I am using is from the kaggle Whats Cooking competition
$endgroup$
– Byren Higgin
Aug 9 '18 at 3:07