1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Why do we have to remove most common words for text analysis?

Discussion in 'Computer Science' started by Sam, Oct 8, 2018.

  1. Sam

    Sam Guest

    I am trying to do sentiment analysis the task is to classify racist tweets from other tweets. And I have read many articles and many have mentioned to remove the most common 10 words from the column because their presence will not of any use in classification of our text data.

    So these are my top 10 most common words on my dataset.

    [('love', 4271),
    ('day', 3572),
    ('amp', 2709),
    ('happy', 2651),
    ('u', 1840),
    ('time', 1771),
    ('im', 1770),
    ('life', 1756),
    ('like', 1700),
    ('today', 1591)]

    If I remove these will my classification model be more accurate?

    Similarly they are also recommending to remove the top 10 rare words from the column.

    I want to know why? Any help

    Login To add answer/comment

Share This Page