Error on Python Version? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Error on Python Version? (/thread-31391.html) |
Error on Python Version? - ErnestTBass - Dec-08-2020 #per SGD classifier i dati devono essere numerically encoded, not dict from sklearn.linear_model import SGDClassifier clf = SGDClassifier() clf.fit(X_train, y_train) train_score = clf.score(X_train, y_train) valid_score = clf.score(X_valid, y_valid)I get the following error: I believe that this is an error from using the wrong version of python. I use python 3.83 on Windows 10. I am ot sure how to fix it.Any help appreciated. Thanks in advance Respectfully, ErnestTBass RE: Error on Python Version? - perfringo - Dec-08-2020 Error message says that float() argument must be string or a number and this is valid to all Python versions: From float() documentation: Quote:Return a floating point number constructed from a number or string x. It can be easily demonstrated in Python code: >>> float('42') 42.0 >>> float(42) 42.0 >>> float({42:0}) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: float() argument must be a string or a number, not 'dict' RE: Error on Python Version? - ErnestTBass - Dec-08-2020 Okay, I am sure that you are right. In my case in seems that X_train and y_train (or both) are numbers. How did they become dict? That error just makes no sense. It seems that somehow they went from float or int to dict. But where? Where did this happen? I believe that casting both of them to floats should work should it not? It seems this is the solution. Any help appreciated. Thanks in advance. Respectfully, ErnestTBass RE: Error on Python Version? - bowlofred - Dec-08-2020 No, you can't cast a dict to a float. Right before your call of the function, add some info to see what the variables are. Find out if the problem is in your code. x_train = 42 y_train = {"a": 1} print(f"x_train is of type {type(x_train)} and y_train is of type {type(y_train)}") RE: Error on Python Version? - deanhystad - Dec-08-2020 What type are you supposed to pass? I thought fit and score expected 2D arrays. RE: Error on Python Version? - ErnestTBass - Dec-09-2020 I put the print statement that you gave me directly in the code. per SGD classifier i dati devono essere numerically encoded, not dict from sklearn.linear_model import SGDClassifier clf = SGDClassifier() print(f"X_train is of type {type(X_train)} and y_train is of type {type(y_train)}") clf.fit(X_train, y_train) train_score = clf.score(X_train, y_train) valid_score = clf.score(X_valid, y_valid)It produced an error as you can see: It said both X_train and y_train are lists. That is exactly what I want them to be. Stillthe error insists that at least one of them is dict. I assume this means dictionary. If both are lists what happened to make at least one a dict? I can post all the code that precedes the error. I am really confused and have no idea how it came to call one of them a dict. Any help appreciated, Thanks in advance. Respectfully, ErnestTBass I can post a screenshot of the printout of the line. print(f"X_train is of type {type(X_train)} and y_train is of type {type(y_train)}")I am just not sure how to do it. RE: Error on Python Version? - ErnestTBass - Dec-09-2020 Okay to put things in context, I am posting the code for the program, You can see where it fails. #!/usr/bin/env python # coding: utf-8 # https://opendatascience.com/intro-to-natural-language-processing/ # In[ ]: #!pip install nltk import nltk #per risolvere un bug, altrimenti da errore nltk.download('punkt') #tokenizer def format_sentence(sent): return({word: True for word in nltk.word_tokenize(sent)}) # #Tweets # In[ ]: print(nltk.word_tokenize("The cat is very cute")) ##X_train, y_train, X_test, y_test # In[ ]: # X + y #se chiamiamo a al di fuori di questo slot non funziona total = open('pos_tweets.txt') X_pos = list() y_pos = list() #word tokenization for sentence in total: #print(sentence) X_pos.append([format_sentence(sentence)]) y_pos.append(0) #saves the sentence in format: [{tokenized sentence}, 'pos] #X_pos # In[ ]: # X + y #se chiamiamo a al di fuori di questo slot non funziona total = open('pos_tweets.txt') X_neg = list() y_neg = list() #word tokenization for sentence in total: #print(sentence) X_neg.append([format_sentence(sentence)]) y_neg.append(1) #saves the sentence in format: [{tokenized sentence}, 'pos] #X_neg # In[ ]: X_pos[0] # In[ ]: X = X_pos + X_neg y = y_pos + y_neg print(len(X), len(y)) # In[ ]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) print(len(X_train), len(X_test), len(y_train), len(y_test)) # In[ ]: #we can use Embedding layers #we can use a ML algorithm that takes X_train, y_train, X_test, y_test # In[ ]: #per SGD classifier i dati devono essere numerically encoded, not dict from sklearn.linear_model import SGDClassifier clf = SGDClassifier() print(f"X_train is of type {type(X_train)} and y_train is of type {type(y_train)}") clf.fit(X_train, y_train) train_score = clf.score(X_train, y_train) valid_score = clf.score(X_valid, y_valid) # ##Xy_train, Xy_test # In[ ]: # X + y #se chiamiamo a al di fuori di questo slot non funziona total = open('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/pos_tweets.txt') Xy_pos = list() #word tokenization for sentence in total: #print(sentence) Xy_pos.append([format_sentence(sentence), 'pos']) #saves the sentence in format: [{tokenized sentence}, 'pos] #Xy_pos # In[ ]: # X + y #se chiamiamo a al di fuori di questo slot non funziona total = open('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/neg_tweets.txt') Xy_neg = list() #word tokenization for sentence in total: #print(sentence) Xy_neg.append([format_sentence(sentence), 'neg']) #saves the sentence in format: [{tokenized sentence}, 'pos] #Xy_neg # In[ ]: len(Xy_neg) # In[ ]: Xy_pos[0] # In[ ]: def split(pos, neg, ratio): train = pos[:int((1-ratio)*len(pos))] + neg[:int((1-ratio)*len(neg))] test = pos[int((ratio)*len(pos)):] + neg[int((ratio)*len(neg)):] return train, test Xy_train, Xy_test = split(Xy_pos, Xy_neg, 0.1) # In[ ]: from nltk.classify import NaiveBayesClassifier #encoded thorugh dictionaries classifier = NaiveBayesClassifier.train(Xy_train) classifier.show_most_informative_features() # In[ ]: example2 = "beautiful" print(classifier.classify(format_sentence(example2))) # In[ ]: from nltk.classify.util import accuracy print(accuracy(classifier, Xy_test)) # ##Movies # In[ ]: import pandas as pd total = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/movie_review.csv') total # In[ ]: total_positive = total.copy() total_positive.columns total_positive = total_positive.loc[total_positive['tag'] == 'pos'] #total_positive = total_positive.pop('text') total_positive = total_positive.drop(['fold_id', 'cv_tag', 'html_id', 'sent_id'], axis=1) total_positive # In[ ]: total_negative = total.copy() total_negative.columns total_negative = total_negative.loc[total_negative['tag'] == 'neg'] #total_negative = total_negative.pop('text') total_negative = total_negative.drop(['fold_id', 'cv_tag', 'html_id', 'sent_id'], axis=1) total_negative # In[ ]: format_sentence('how are you') # In[ ]: # tokenizer #input: series, ?lists? def create_dict(total_positive, total_negative): positive_reviews = list() #word tokenization for sentence in list(total_positive.values): positive_reviews.append([format_sentence(sentence[0]), 'pos']) #saves the sentence in format: [{tokenized sentence}, 'pos] negative_reviews = list() #word tokenization for sentence in list(total_negative.values): #print(sentence) negative_reviews.append([format_sentence(sentence[0]), 'neg']) #saves the sentence in format: [{tokenized sentence}, 'pos] return positive_reviews, negative_reviews positive_reviews, negative_reviews = create_dict(total_positive, total_negative) # In[ ]: X = pd.concat([total_positive, total_negative], axis=0) X.columns = ['text', 'sentiment'] X # In[ ]: import seaborn as sns sns.countplot(x='sentiment', data=X) # In[ ]: y = pd.DataFrame(X.pop('sentiment')) y # In[ ]: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) # In[ ]: X_train['text'][0] # In[ ]: ##del? def preprocess_text(sen): # Removing html tags sentence = remove_tags(sen) # Remove punctuations and numbers sentence = re.sub('[^a-zA-Z]', ' ', sentence) # Single character removal sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence) # Removing multiple spaces sentence = re.sub(r'\s+', ' ', sentence) return sentence TAG_RE = re.compile(r'<[^>]+>') #replaces anything between <> with an empty space def remove_tags(text): return TAG_RE.sub('', text) # In[ ]: from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=5000) tokenizer.fit_on_texts(X_train) X_train = tokenizer.texts_to_sequences(X_train) X_test = tokenizer.texts_to_sequences(X_test) # In[ ]: tokenizer = Tokenizer(num_words=5000) tokenizer.fit_on_texts('come stai') tokenizer.texts_to_sequences('come stai') # In[ ]: #del? X = [] sentences = list(movie_reviews['review']) for sen in sentences: X.append(preprocess_text(sen)) # In[ ]: print(len(positive_reviews)) print(len(negative_reviews)) # In[ ]: train = positive_reviews[:int((.9)*len(positive_reviews))] + negative_reviews[:int((.9)*len(negative_reviews))] test = positive_reviews[int((.1)*len(positive_reviews)):] + negative_reviews[int((.1)*len(negative_reviews)):] print(len(train), len(test)) # In[ ]: print(train[0]) # In[ ]: from nltk.classify import NaiveBayesClassifier classifier = NaiveBayesClassifier.train(train) classifier.show_most_informative_features() # In[ ]: example2 = "mulan" print(classifier.classify(format_sentence(example2))) # In[ ]: from nltk.classify.util import accuracy print(accuracy(classifier, test)) # In[ ]: get_ipython().system('python -V') # In[ ]:I cannot understand where it gets the error both X_train and y__rain are lists as the code says so where does it get dict from? Any help appreciated. Thanks in advance. Respectfully, ErnestTBass |