Text classification

Casper · Apr-11-2017, 10:20 AM

Hi, I'm further exploring machine learning and would like to create a model classifier for sentiment analysis of'tweets' using supervised learning text classification. The data source I;m using comprises of over 1.6 million tweets which I've reduced to 250,000. I'm hoping to go through each step of the process of pre-processing, data cleansing, feature extraction, class labels, training (Naive Bayes) and testing. I've been reading but I'm not sure where to start. The original data set is one single CSV file and comprises of 3 columns, namely "userID", "Sentiment" (0 for negative and 1 for positive), and the "tweet" column itself which I have opened and working in my IDE. It's all in the form of a list of lists structure. First problem I'm having is trying to clean the tweets column so I can proper tokenize and create a bag of words for term frequencies and labeling etc. Did I get that part right I'm not too sure... But I'm trying to run a regular expression on the tweets but it extracts the column out on its own which I don't think is what I want. I want to keep all three columns together right? But then how do i process the strings inside each list of lists? Should I even be doing this or am I starting all wrong? I'm really hoping someone could point me in the right direction as how to go about all of this.

clean = []
for row in annotated_data[:10]:
m = re.findall(r'''\d+(?:,\d*)*(?:\.\d*)?|\w+(?:[-/]\w+)?|[^\s]+''', row[2])
clean.append(m)
print(clean)

Printout portion of untouched original data

Output:
[['1', '0', '                     is so sad for my APL friend.............'], ['2', '0', '                   I missed the New Moon trailer...'], ['3', '1', '              omg its already 7:30 :O'], ['4', '0', "          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)..."], ['5', '0', '         i think mi bf is cheating on me!!!       T_T'], ['6', '0', '         or i just worry too much?        '], ['7', '1', '       Juuuuuuuuuuuuuuuuussssst Chillin!!'], ['8', '0', '       Sunny Again        Work Tomorrow  :-|       TV Tonight'], ['9', '1', '      handed in my uniform today . i miss you already'], ['10', '1', '      hmmmm.... i wonder how she my number @-)'], ['11', '0', '      I must think about positive..'], ['12', '1', '      thanks to all the haters up in my face all day! 112-102'], ['13', '0', '      this weekend has sucked so far'], ['14', '0', '     jb isnt showing in australia any more!'], ['15', '0', '     ok thats it you win.'], ['16', '0', '    &lt;-------- This is the way i feel right now...'], ['17', '0', "    awhhe man.... I'm completely useless rt now. Funny, all I can do is twitter. http://myloc.me/27HX"]]

Output print after regular expression, looks better but the other two columns are gone.

Output:
[['is', 'so', 'sad', 'for', 'my', 'APL', 'friend', '.............'], ['I', 'missed', 'the', 'New', 'Moon', 'trailer', '...'], ['omg', 'its', 'already', '7', ':30', ':O'], ['..', 'Omgaga', '.', 'Im', 'sooo', 'im', 'gunna', 'CRy', '.', 'I', "'ve", 'been', 'at', 'this', 'dentist', 'since', '11.', '.', 'I', 'was', 'suposed', '2', 'just', 'get', 'a', 'crown', 'put', 'on', '(30mins)...'], ['i', 'think', 'mi', 'bf', 'is', 'cheating', 'on', 'me', '!!!', 'T_T'], ['or', 'i', 'just', 'worry', 'too', 'much', '?'], ['Juuuuuuuuuuuuuuuuussssst', 'Chillin', '!!'], ['Sunny', 'Again', 'Work', 'Tomorrow', ':-|', 'TV', 'Tonight'], ['handed', 'in', 'my', 'uniform', 'today', '.', 'i', 'miss', 'you', 'already'], ['hmmmm', '....', 'i', 'wonder', 'how', 'she', 'my', 'number', '@-)'], ['I', 'must', 'think', 'about', 'positive', '..'], ['thanks', 'to', 'all', 'the', 'haters', 'up', 'in', 'my', 'face', 'all', 'day', '!', '112', '-102'], ['this', 'weekend', 'has', 'sucked', 'so', 'far'], ['jb', 'isnt', 'showing', 'in', 'australia', 'any', 'more', '!'], ['ok', 'thats', 'it', 'you', 'win', '.'], ['&lt;--------', 'This', 'is', 'the', 'way', 'i', 'feel', 'right', 'now', '...'], ['awhhe', 'man', '....', 'I', "'m", 'completely', 'useless', 'rt', 'now', '.', 'Funny', ',', 'all', 'I', 'can', 'do', 'is', 'twitter', '.', 'http', '://myloc.me/27HX']]

**buran** · (This post was last modified: Apr-11-2017, 11:00 AM by buran.)

the first two columns are gone because you loop over the elements of your list of lists and then process only the third element of each element/list via RegEx when suppling row[2] as second argument to re.findall.:

If the csv file is large, I would iterate over it and read it row by row, instead of reading everything in memory.
also check NLTK, instead of using RegEx to tokenize and analyse the tweets.

***zivoni*** · Apr-11-2017, 11:06 AM

If you want user id and sentiment together in the one list, you can modify your code so its added to clean list:

m = row[:2] + [re.findall(r'''\d+(?:,\d*)*(?:\.\d*)?|\w+(?:[-/]\w+)?|[^\s]+''', row[2])]

And when you are done with deleting rows, you can keep sentiment and user_id in separate list/array, process your words and concatenate it together at the end (if needed).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to plot confusion matrix of multiclass classification	Vaishali	0	1,803	Feb-10-2022, 02:34 PM Last Post: Vaishali
	[split] Kera Getting errors when following code example Image classification from scratch	hobbyist	3	5,587	Apr-13-2021, 01:26 PM Last Post: amirian
	Applying Multi-Class Classification instead of binary classification	alex80	1	2,222	Sep-18-2020, 07:39 AM Last Post: alex80

Text classification

User Panel Messages

Announcements