Python Forum
Need help excluding Named Entity (NE) and proper nouns (NNE) from text analysis
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help excluding Named Entity (NE) and proper nouns (NNE) from text analysis
#1
Need help excluding Named Entity (NE) and proper nouns (NNE) from text.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

def clean_article(articleName):
    token_frequency_dic = {}
    with open(articleName,'r') as f:
        text = f.read()

        # split into words
        tokens = word_tokenize(text)

        # convert to lower case
        tokens = [w.lower() for w in tokens]

        # remove punctuation from each word
        table = str.maketrans('', '', string.punctuation)
        stripped = [w.translate(table) for w in tokens]

        # remove remaining tokens that are not alphabetic
        words = [word for word in stripped if word.isalpha()]
        
        # tag parts of speech and chunk name entities
#        tagged =nltk.pos_tag(words)
#        namedEd = nltk.ne_chunk(tagged, binary=True)
        
        # filter out stop words and sort
        stop_words = set(stopwords.words('english'))
        words = [w for w in words if not w in stop_words]
        words.sort()
        req = nltk.FreqDist(words)
        for k,v in req.items():
            token_frequency_dic[str(k)] = v
        return token_frequency_dic

    f.close()
test = clean_article('sample_article.txt')
print(test)
sample_article: The Supreme Court on Monday struck down a federal law that effectively banned commercial sports betting in most states, boosting the prospect of such gambling across the nation. The case concerned New Jersey, but the court’s decision opened the door for other states that are eager to allow and tax sports gambling. Americans are estimated to annually place $150 billion in illegal wagers on sports, and many states seem interested in making such wagers legal and reaping tax revenues from them. State officials and representatives of the casino industry greeted the ruling with something like glee.

Output:
{'across': 1, 'allow': 1, 'americans': 1, 'annually': 1, 'banned': 1, 'betting': 1, 'billion': 1, 'boosting': 1, 'case': 1, 'casino': 1, 'commercial': 1, 'concerned': 1, 'court': 2, 'decision': 1, 'door': 1, 'eager': 1, 'effectively': 1, 'estimated': 1, 'federal': 1, 'gambling': 2, 'glee': 1, 'greeted': 1, 'illegal': 1, 'industry': 1, 'interested': 1, 'jersey': 1, 'law': 1, 'legal': 1, 'like': 1, 'making': 1, 'many': 1, 'monday': 1, 'nation': 1, 'new': 1, 'officials': 1, 'opened': 1, 'place': 1, 'prospect': 1, 'reaping': 1, 'representatives': 1, 'revenues': 1, 'ruling': 1, 'seem': 1, 'something': 1, 'sports': 3, 'state': 1, 'states': 3, 'struck': 1, 'supreme': 1, 'tax': 2, 'wagers': 2}
How can I get the same output without named identity and proper nouns.

Thanks community!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pyserial issues with proper loops and binary jttolleson 16 2,628 Nov-02-2023, 08:39 PM
Last Post: deanhystad
  Getting proper x,y axis values pyhill00 8 1,686 Jul-29-2022, 06:48 PM
Last Post: pyhill00
  Proper way to do the OR statement? Mark17 5 1,778 Mar-01-2022, 01:54 PM
Last Post: Mark17
  Regex excluding ClassicalSoul 2 2,168 Apr-17-2020, 03:58 PM
Last Post: DeaD_EyE
  proper use of 'end' in a 'with' statement ccrider27 1 2,068 Mar-18-2020, 10:33 PM
Last Post: buran
  Proper use of if..elif..else statement nick1941 2 2,429 Mar-06-2020, 11:22 PM
Last Post: nick1941
  Proper Layout of Code Question TheJax 2 2,212 Feb-08-2020, 06:14 PM
Last Post: TheJax
  Unable to do the proper split using re.sub incase of missing data. Karz 1 1,874 Nov-17-2019, 05:58 PM
Last Post: buran
  getopt with tuple not working proper Frank123456 0 1,889 Aug-21-2019, 12:46 PM
Last Post: Frank123456
  proper syntax for itertuples? ilcaa72 1 2,011 Jun-06-2019, 02:41 AM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020