Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Clean Data using NLTK
#1
Need help creating a function that cleans data and puts frequency in dictionary.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

#create a function and dictionary
def clean_data(tokenizeFreq)
token_frequency_dic = {}

# load data
article = open('sample_data.txt','r')
text = article.read()
file.close()

# split into words
tokens = word_tokenize(text)

# convert to lower case
tokens = [w.lower() for w in tokens]

# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# filter out stop words and sort
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
words.sort()

# print frequency distribution
req = nltk.FreqDist(words)
for k,v in req.items():
    print(str(k) + ': ' + str(v))
can this be condense into a for loop...
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  get nltk data Pedroski55 7 5,158 Aug-12-2024, 06:16 AM
Last Post: Pedroski55
  Can i clean this code ? BSDevo 8 2,228 Oct-28-2023, 05:50 PM
Last Post: BSDevo
  Clean Up Script rotw121 2 1,765 May-25-2022, 03:24 PM
Last Post: rotw121
  Help with simple nltk Chatbot Extra 3 3,393 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  How to clean UART string Joni_Engr 4 3,748 Dec-03-2021, 05:58 PM
Last Post: deanhystad
  Saving a download of stopwords (nltk) Drone4four 1 12,895 Nov-19-2020, 11:50 PM
Last Post: snippsat
  Installing nltk dependency Eshwar 0 2,506 Aug-30-2020, 06:10 PM
Last Post: Eshwar
  How to clean session mqtt SayHiii 0 2,520 Dec-09-2019, 07:56 AM
Last Post: SayHiii
  how to clean up unstarted processes? Skaperen 2 2,917 Aug-27-2019, 05:37 AM
Last Post: Skaperen
  sched.scheduler -> clean denisit 1 3,596 Nov-28-2018, 09:52 AM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020