Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
split and test tweet data
#1
Hi guys i have a twitter dataset i want to train and test with NB and SVM. After cleaning and vectorizing, i am stuck on the following:
1. splitting of data into 80/20
2. fitting this into classifier.

Your guidance will be highly appreciated.

# tokenize helper function
def text_process(raw_text):
   
    # Check  punctuation
    nopunc = [char for char in list(raw_text) if char not in string.punctuation]
    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # remove any stopwords
    return [word for word in nopunc.lower().split() if word.lower() not in stopwords.words('english')]

def remove_words(word_list):
    remove = ['he','and','...','“','”','’','…','and’']
    return [w for w in word_list if w not in remove]

# tokenize message column and create a column for tokens
df_marathon = df_marathon.copy()
df_marathon['tokens'] = df_marathon['Text'].apply(text_process) # step 1
df_marathon['tokenized_tweet'] = df_marathon['tokens'].apply(remove_words) # ste 2
df_marathon.head(10)

# vectorize
bow_transformer = CountVectorizer(analyzer=text_process).fit(df_marathon['tokenized_tweet'])
# print total number of vocab words
print(len(bow_transformer.vocabulary_))

# example of vectorized text
sample_tweet = df_marathon['tokenized_tweet'][16]
print(sample_tweet)
print('\n')
# vector representation
bow_sample = bow_transformer.transform([sample_tweet])
print(bow_sample)
print('\n')

# transform the entire DataFrame of messages
X = bow_transformer.transform(df_marathon['Text'])
# check out the bag-of-words counts for the entire corpus as a large sparse matrix
print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurences: ', X.nnz)

#convert values obtained using the bag of words model into TFIDF values
tfidfconverter = TfidfTransformer()  
X = tfidfconverter.fit_transform(X)

X.toarray()
Error on splitting:

ValueError
Error:
Traceback (most recent call last) ~\Anaconda3\envs\Ipython.display\lib\site-packages\scipy\sparse\csr.py in asindices(x) 243 if idx_dtype != x.dtype: --> 244 x = x.astype(idx_dtype) 245 except: ValueError: invalid literal for int() with base 10: 'tokenized_tweet' During handling of the above exception, another exception occurred: IndexError Traceback (most recent call last) <ipython-input-37-da320e3c2eee> in <module> 1 # split data into rain and test, we are creating train set 80% and test set 20% ----> 2 Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(X['tokenized_tweet'],ML_Corpus['label'],test_size=0.2) ~\Anaconda3\envs\Ipython.display\lib\site-packages\scipy\sparse\csr.py in __getitem__(self, key) 333 row, col = self._index_to_arrays(row, col) 334 --> 335 row = asindices(row) 336 col = asindices(col) 337 if row.shape != col.shape: ~\Anaconda3\envs\Ipython.display\lib\site-packages\scipy\sparse\csr.py in asindices(x) 244 x = x.astype(idx_dtype) 245 except: --> 246 raise IndexError('invalid index') 247 else: 248 return x IndexError: invalid index Error on fitting classifier: ValueError Traceback (most recent call last) <ipython-input-38-6fc83fb90fa2> in <module> ----> 1 classifier = MultinomialNB().fit(X, df_marathon) ~\Anaconda3\envs\Ipython.display\lib\site-packages\sklearn\naive_bayes.py in fit(self, X, y, sample_weight) 583 self : object 584 """ --> 585 X, y = check_X_y(X, y, 'csr') 586 _, n_features = X.shape 587 ~\Anaconda3\envs\Ipython.display\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator) 759 dtype=None) 760 else: --> 761 y = column_or_1d(y, warn=True) 762 _assert_all_finite(y) 763 if y_numeric and y.dtype.kind == 'O': ~\Anaconda3\envs\Ipython.display\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn) 795 return np.ravel(y) 796 --> 797 raise ValueError("bad input shape {0}".format(shape)) 798 799 ValueError: bad input shape (200, 3)
Reply
#2
It would be helpful if you posted the whole thing so it could be tested...
Error:
NameError: name 'df_marathon' is not defined
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Class test : good way to split methods into several files paul18fr 4 473 Jan-30-2024, 11:46 AM
Last Post: Pedroski55
  counting lines in split data Skaperen 6 1,406 Oct-07-2022, 07:09 PM
Last Post: Skaperen
  split txt file data on the first column value shantanu97 2 2,432 Dec-29-2021, 05:03 PM
Last Post: DeaD_EyE
  How to test and import a model form computer to test accuracy using Sklearn library Anldra12 6 3,119 Jul-03-2021, 10:07 AM
Last Post: Anldra12
  How do I split a dataset into test/train/validation according to a particular group? 69195Student 1 2,274 May-12-2021, 08:27 PM
Last Post: bowlofred
  How to make a test data file for the full length of definition? MDRI 6 3,539 Apr-16-2021, 01:47 AM
Last Post: MDRI
  Pandas: how to split one row of data to multiple rows and columns in Python GerardMoussendo 4 6,811 Feb-22-2021, 06:51 PM
Last Post: eddywinch82
  Check if tweet is favorited by myself | TWEEPY altayko 0 1,361 Sep-16-2020, 05:40 PM
Last Post: altayko
  PyTest >> Yaml parsed data to create api test request AmR 0 1,832 Apr-14-2020, 11:41 AM
Last Post: AmR
  How to write test cases for a init function by Unit test in python? binhduonggttn 2 3,110 Feb-24-2020, 12:06 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020