Python Forum

Full Version: Need help excluding Named Entity (NE) and proper nouns (NNE) from text analysis
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Need help excluding Named Entity (NE) and proper nouns (NNE) from text.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

def clean_article(articleName):
    token_frequency_dic = {}
    with open(articleName,'r') as f:
        text = f.read()

        # split into words
        tokens = word_tokenize(text)

        # convert to lower case
        tokens = [w.lower() for w in tokens]

        # remove punctuation from each word
        table = str.maketrans('', '', string.punctuation)
        stripped = [w.translate(table) for w in tokens]

        # remove remaining tokens that are not alphabetic
        words = [word for word in stripped if word.isalpha()]
        
        # tag parts of speech and chunk name entities
#        tagged =nltk.pos_tag(words)
#        namedEd = nltk.ne_chunk(tagged, binary=True)
        
        # filter out stop words and sort
        stop_words = set(stopwords.words('english'))
        words = [w for w in words if not w in stop_words]
        words.sort()
        req = nltk.FreqDist(words)
        for k,v in req.items():
            token_frequency_dic[str(k)] = v
        return token_frequency_dic

    f.close()
test = clean_article('sample_article.txt')
print(test)
sample_article: The Supreme Court on Monday struck down a federal law that effectively banned commercial sports betting in most states, boosting the prospect of such gambling across the nation. The case concerned New Jersey, but the court’s decision opened the door for other states that are eager to allow and tax sports gambling. Americans are estimated to annually place $150 billion in illegal wagers on sports, and many states seem interested in making such wagers legal and reaping tax revenues from them. State officials and representatives of the casino industry greeted the ruling with something like glee.

Output:
{'across': 1, 'allow': 1, 'americans': 1, 'annually': 1, 'banned': 1, 'betting': 1, 'billion': 1, 'boosting': 1, 'case': 1, 'casino': 1, 'commercial': 1, 'concerned': 1, 'court': 2, 'decision': 1, 'door': 1, 'eager': 1, 'effectively': 1, 'estimated': 1, 'federal': 1, 'gambling': 2, 'glee': 1, 'greeted': 1, 'illegal': 1, 'industry': 1, 'interested': 1, 'jersey': 1, 'law': 1, 'legal': 1, 'like': 1, 'making': 1, 'many': 1, 'monday': 1, 'nation': 1, 'new': 1, 'officials': 1, 'opened': 1, 'place': 1, 'prospect': 1, 'reaping': 1, 'representatives': 1, 'revenues': 1, 'ruling': 1, 'seem': 1, 'something': 1, 'sports': 3, 'state': 1, 'states': 3, 'struck': 1, 'supreme': 1, 'tax': 2, 'wagers': 2}
How can I get the same output without named identity and proper nouns.

Thanks community!