Python Forum

I have the following code:

import nltk

nltk.download('stopwords')

import nltk.corpus
import re
import string

# turn a doc into clean tokens
from load_file_with_function import load_doc


def clean_doc(doc):
    # split the tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape((string.punctuation)))
    # remove punctuation from each wor
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop-words
    stop_words = set(nltk.corpus.stopwords.words('english'))
    
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    print(tokens)

It is working because it is someone else's code - I have to work further on it

I'm unable to understand how this statement below is filtering out non alphabets from my set of words (tokens)

tokens = [word for word in tokens if word.isalpha()]

I know about the string function isalpha() but do not follow how the "new" tokens get rid of non alphabets in a single statement like this. Can anyone please explain?

This is a list comprehension. It is a compact way of writing this:

temp = []
for word in tokens:
    if word.isalpha()
        temp.append(word)
tokens = temp

tokens = [] says the resulting list is assigned to "tokens".
[word for word in tokens] says the list is going to be made up of words from the original "tokens".
if isalpha(word) says only include words that are "isalpha".

ateestructural

deanhystad