May-15-2018, 12:10 AM
Need help excluding Named Entity (NE) and proper nouns (NNE) from text.
Thanks community!
import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import string def clean_article(articleName): token_frequency_dic = {} with open(articleName,'r') as f: text = f.read() # split into words tokens = word_tokenize(text) # convert to lower case tokens = [w.lower() for w in tokens] # remove punctuation from each word table = str.maketrans('', '', string.punctuation) stripped = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic words = [word for word in stripped if word.isalpha()] # tag parts of speech and chunk name entities # tagged =nltk.pos_tag(words) # namedEd = nltk.ne_chunk(tagged, binary=True) # filter out stop words and sort stop_words = set(stopwords.words('english')) words = [w for w in words if not w in stop_words] words.sort() req = nltk.FreqDist(words) for k,v in req.items(): token_frequency_dic[str(k)] = v return token_frequency_dic f.close()
test = clean_article('sample_article.txt')
print(test)sample_article: The Supreme Court on Monday struck down a federal law that effectively banned commercial sports betting in most states, boosting the prospect of such gambling across the nation. The case concerned New Jersey, but the court’s decision opened the door for other states that are eager to allow and tax sports gambling. Americans are estimated to annually place $150 billion in illegal wagers on sports, and many states seem interested in making such wagers legal and reaping tax revenues from them. State officials and representatives of the casino industry greeted the ruling with something like glee.
Output:{'across': 1, 'allow': 1, 'americans': 1, 'annually': 1, 'banned': 1, 'betting': 1, 'billion': 1, 'boosting': 1, 'case': 1, 'casino': 1, 'commercial': 1, 'concerned': 1, 'court': 2, 'decision': 1, 'door': 1, 'eager': 1, 'effectively': 1, 'estimated': 1, 'federal': 1, 'gambling': 2, 'glee': 1, 'greeted': 1, 'illegal': 1, 'industry': 1, 'interested': 1, 'jersey': 1, 'law': 1, 'legal': 1, 'like': 1, 'making': 1, 'many': 1, 'monday': 1, 'nation': 1, 'new': 1, 'officials': 1, 'opened': 1, 'place': 1, 'prospect': 1, 'reaping': 1, 'representatives': 1, 'revenues': 1, 'ruling': 1, 'seem': 1, 'something': 1, 'sports': 3, 'state': 1, 'states': 3, 'struck': 1, 'supreme': 1, 'tax': 2, 'wagers': 2}
How can I get the same output without named identity and proper nouns.Thanks community!