1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

tensorflow categorical data with vocabulary list - Expected binary or Unicode string, got...

Discussion in 'Computer Science' started by Byren Higgin, Oct 8, 2018.

  1. Byren Higgin

    Byren Higgin Guest

    I'm brand new to machine learning (having just completed the google machine learning crash course) and thought it would be good to try my hand at a Kaggle competition as a good starter to some real problem solving. I'm using tensorflow and Python 3, all up to date (the kaggle online jupyter notebook)

    The data is formatted in a dataframe like below

    |Identity | Cuisine | Ingredients |
    |1 | italian | [beans, milk,..., tomatoes]|
    |2 | indian | [chicken, curry leaf,...] |

    I have made a vocabulary list generator to create a vocabulary set, and replace instances of those words in the ingredients array with the index of the ingredient in the vocabulary set, so my original data looks like below.

    |Identity | Cuisine | Ingredients |
    |1 | italian |[0, 1,..., 4]|
    |2 | indian |[5, 6,...] |

    I seperate the labels (cuisine) and the features (ingredients) into 2 seperate dataframes for ease, and I am using a tf.feature_column.categorical_column_with_vocabulary_list and subsequent tf.feature_column.indicator_column for the ingredients array.

    I now however have an issue with my model not being able to read the ingredients column, and get the error

    TypeError: Expected binary or unicode string, got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

    my input function is as follows

    def input_fn(features,labels,batch_size,num_epochs=None,shuffle=True):
    ds = Dataset.from_tensor_slices((features,labels))
    ds = ds.batch(batch_size).repeat(num_epochs)

    if shuffle:
    ds = ds.shuffle(10000)

    feature_batch, label_batch = ds.make_one_shot_iterator().get_next()
    return feature_batch, label_batch

    which is fed into a simple function as below

    training_func = lambda: input_fn(training_example,training_target,batch_size)
    validati_func = lambda: input_fn(validation_example,validation_target,batch_size)

    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)


    My urgent question is how do I fix this TypeError

    In addition I also want to know if there a best practice for handling this format of data? (and if there is any built-in functionality to handle this)

    Login To add answer/comment

Share This Page