Python Forum
Keras: tweets classicifcation
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Keras: tweets classicifcation
#1
Hello dear forum members,

I have a data set of 20 Million randomly collected individual tweets (no two tweets come from the same account). Let me refer to this data set as "general" data set. Also, I have another "specific" data set that includes 100,000 tweets collected from drug (opioid) abusers. Each tweet has at least one tag associated with it, e.g., opioids, addiction, overdose, hydrocodone, etc. (max 25 tags). My goal is to use the "specific" data set to train the model using Keras and then use it to tag tweets in the "general" data set to identify tweets that might have been written by drug abusers.

Following examples in source1 and source2, I managed to build a simple working version of such model:

from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils

# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]

# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)

# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)

# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 2, verbose = 1, validation_split = 0.1)

# test prediction accuracy
score = model.evaluate(x_test, y_test, 
                       batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

# make predictions using a test set
for i in range(1000):    
    prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_ 
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
In order to move forward, I would like to clarify a few things:

1. Let's say all my training tweets have a single tag -- opioids. Then if I pass the non-tagged tweets through it, isn't it likely that the model simply tags all of them as opioids as it doesn't know anything else? Should I be using a variety of different tweets/tags then for the learning purpose? Perhaps, there are any general guidelines for the selection of the tweets/tags for the training purposes?
2. How can I add more columns with tags for training (not a single one like is used in the code)?
3. Once I train the model and achieve appropriate accuracy, how do I pass non-tagged tweets through it to make predictions?
4. How do I add a confusion matrix?

Any other relevant feedback is also greatly appreciated.

Thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  1st layer tf.keras output shape set at multiple - need help! Afrodizzyjack 0 1,817 Jun-07-2022, 04:53 PM
Last Post: Afrodizzyjack
  issue displaying summary of whole keras CNN model on TensorFlow with python Afrodizzyjack 0 1,648 Oct-27-2021, 04:07 PM
Last Post: Afrodizzyjack
  Understanding Keras and TensorFlow and how to use them bytecrunch 1 2,072 Mar-11-2021, 02:40 PM
Last Post: jefsummers
  Problems feeding live input from my microphone into a keras model (SegFault: 11) zeptozetta 1 2,577 Sep-14-2020, 03:08 AM
Last Post: zeptozetta
  Keras.Predict into Dataframe Finpyth 13 9,780 Aug-31-2020, 07:22 AM
Last Post: hussainmujtaba
  Making a Basic Keras Model - Input Shape and Parameters MattKahn13 0 2,113 Aug-16-2020, 04:36 PM
Last Post: MattKahn13
  Error when import Keras Azadfalah 1 2,791 Apr-29-2020, 04:45 AM
Last Post: buran
  Keras + Matplotlib causing crash spearced 3 4,505 Feb-06-2020, 04:54 PM
Last Post: zljt3216
  Keras Dense layer with wrong input d1r4c 0 1,760 Jan-02-2020, 02:35 PM
Last Post: d1r4c
  Keras: Time series classification midarq 0 1,993 Sep-25-2019, 09:03 AM
Last Post: midarq

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020