Tokenize text with both American and English words

Discussion in 'Computer Science' started by user3259111, Oct 8, 2018.

  user3259111

    user3259111

    I need to tokenize a corpus of abstracts from an international conference. The abstracts are usually American English but sometimes British English.

    Consequently, I get 2 tokens for “organization” and “organisation” or “color” and “colour”. Examples : https://en.oxforddictionaries.com/spelling/british-and-spelling

    Do you know a (python) library converting “British English” to “American English” (or vis versa) ?

    I would be happy to that ... (but I am french and my english is not soo good)


