split and test tweet data - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: split and test tweet data (/thread-18171.html) |
split and test tweet data - Jmekubo - May-08-2019 Hi guys i have a twitter dataset i want to train and test with NB and SVM. After cleaning and vectorizing, i am stuck on the following: 1. splitting of data into 80/20 2. fitting this into classifier. Your guidance will be highly appreciated. # tokenize helper function def text_process(raw_text): # Check punctuation nopunc = [char for char in list(raw_text) if char not in string.punctuation] # Join the characters again to form the string. nopunc = ''.join(nopunc) # remove any stopwords return [word for word in nopunc.lower().split() if word.lower() not in stopwords.words('english')] def remove_words(word_list): remove = ['he','and','...','“','”','’','…','and’'] return [w for w in word_list if w not in remove] # tokenize message column and create a column for tokens df_marathon = df_marathon.copy() df_marathon['tokens'] = df_marathon['Text'].apply(text_process) # step 1 df_marathon['tokenized_tweet'] = df_marathon['tokens'].apply(remove_words) # ste 2 df_marathon.head(10) # vectorize bow_transformer = CountVectorizer(analyzer=text_process).fit(df_marathon['tokenized_tweet']) # print total number of vocab words print(len(bow_transformer.vocabulary_)) # example of vectorized text sample_tweet = df_marathon['tokenized_tweet'][16] print(sample_tweet) print('\n') # vector representation bow_sample = bow_transformer.transform([sample_tweet]) print(bow_sample) print('\n') # transform the entire DataFrame of messages X = bow_transformer.transform(df_marathon['Text']) # check out the bag-of-words counts for the entire corpus as a large sparse matrix print('Shape of Sparse Matrix: ', X.shape) print('Amount of Non-Zero occurences: ', X.nnz) #convert values obtained using the bag of words model into TFIDF values tfidfconverter = TfidfTransformer() X = tfidfconverter.fit_transform(X) X.toarray()Error on splitting: ValueError
RE: split and test tweet data - michalmonday - May-08-2019 It would be helpful if you posted the whole thing so it could be tested...
|