Aug-01-2020, 08:54 PM
(This post was last modified: Aug-01-2020, 08:54 PM by ateestructural.)
I have the following code:
I'm unable to understand how this statement below is filtering out non alphabets from my set of words (tokens)
import nltk nltk.download('stopwords') import nltk.corpus import re import string # turn a doc into clean tokens from load_file_with_function import load_doc def clean_doc(doc): # split the tokens by white space tokens = doc.split() # prepare regex for char filtering re_punc = re.compile('[%s]' % re.escape((string.punctuation))) # remove punctuation from each wor tokens = [re_punc.sub('', w) for w in tokens] # remove remaining tokens that are not alphabetic tokens = [word for word in tokens if word.isalpha()] # filter out stop-words stop_words = set(nltk.corpus.stopwords.words('english')) # filter out short tokens tokens = [word for word in tokens if len(word) > 1] print(tokens)It is working because it is someone else's code - I have to work further on it
I'm unable to understand how this statement below is filtering out non alphabets from my set of words (tokens)
tokens = [word for word in tokens if word.isalpha()]I know about the string function isalpha() but do not follow how the "new" tokens get rid of non alphabets in a single statement like this. Can anyone please explain?