Clean Data using NLTK - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Clean Data using NLTK (/thread-10108.html) |
Clean Data using NLTK - disruptfwd8 - May-12-2018 Need help creating a function that cleans data and puts frequency in dictionary. import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import string #create a function and dictionary def clean_data(tokenizeFreq) token_frequency_dic = {} # load data article = open('sample_data.txt','r') text = article.read() file.close() # split into words tokens = word_tokenize(text) # convert to lower case tokens = [w.lower() for w in tokens] # remove punctuation from each word table = str.maketrans('', '', string.punctuation) stripped = [w.translate(table) for w in tokens] # remove remaining tokens that are not alphabetic words = [word for word in stripped if word.isalpha()] # filter out stop words and sort stop_words = set(stopwords.words('english')) words = [w for w in words if not w in stop_words] words.sort() # print frequency distribution req = nltk.FreqDist(words) for k,v in req.items(): print(str(k) + ': ' + str(v))can this be condense into a for loop... |