Apr-11-2017, 10:20 AM
Hi, I'm further exploring machine learning and would like to create a model classifier for sentiment analysis of'tweets' using supervised learning text classification. The data source I;m using comprises of over 1.6 million tweets which I've reduced to 250,000. I'm hoping to go through each step of the process of pre-processing, data cleansing, feature extraction, class labels, training (Naive Bayes) and testing. I've been reading but I'm not sure where to start. The original data set is one single CSV file and comprises of 3 columns, namely "userID", "Sentiment" (0 for negative and 1 for positive), and the "tweet" column itself which I have opened and working in my IDE. It's all in the form of a list of lists structure. First problem I'm having is trying to clean the tweets column so I can proper tokenize and create a bag of words for term frequencies and labeling etc. Did I get that part right I'm not too sure... But I'm trying to run a regular expression on the tweets but it extracts the column out on its own which I don't think is what I want. I want to keep all three columns together right? But then how do i process the strings inside each list of lists? Should I even be doing this or am I starting all wrong? I'm really hoping someone could point me in the right direction as how to go about all of this.
clean = [] for row in annotated_data[:10]: m = re.findall(r'''\d+(?:,\d*)*(?:\.\d*)?|\w+(?:[-/]\w+)?|[^\s]+''', row[2]) clean.append(m) print(clean)Printout portion of untouched original data
Output:[['1', '0', ' is so sad for my APL friend.............'], ['2', '0', ' I missed the New Moon trailer...'], ['3', '1', ' omg its already 7:30 :O'], ['4', '0', " .. Omgaga. Im sooo im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)..."], ['5', '0', ' i think mi bf is cheating on me!!! T_T'], ['6', '0', ' or i just worry too much? '], ['7', '1', ' Juuuuuuuuuuuuuuuuussssst Chillin!!'], ['8', '0', ' Sunny Again Work Tomorrow :-| TV Tonight'], ['9', '1', ' handed in my uniform today . i miss you already'], ['10', '1', ' hmmmm.... i wonder how she my number @-)'], ['11', '0', ' I must think about positive..'], ['12', '1', ' thanks to all the haters up in my face all day! 112-102'], ['13', '0', ' this weekend has sucked so far'], ['14', '0', ' jb isnt showing in australia any more!'], ['15', '0', ' ok thats it you win.'], ['16', '0', ' <-------- This is the way i feel right now...'], ['17', '0', " awhhe man.... I'm completely useless rt now. Funny, all I can do is twitter. http://myloc.me/27HX"]]
Output print after regular expression, looks better but the other two columns are gone.Output:[['is', 'so', 'sad', 'for', 'my', 'APL', 'friend', '.............'], ['I', 'missed', 'the', 'New', 'Moon', 'trailer', '...'], ['omg', 'its', 'already', '7', ':30', ':O'], ['..', 'Omgaga', '.', 'Im', 'sooo', 'im', 'gunna', 'CRy', '.', 'I', "'ve", 'been', 'at', 'this', 'dentist', 'since', '11.', '.', 'I', 'was', 'suposed', '2', 'just', 'get', 'a', 'crown', 'put', 'on', '(30mins)...'], ['i', 'think', 'mi', 'bf', 'is', 'cheating', 'on', 'me', '!!!', 'T_T'], ['or', 'i', 'just', 'worry', 'too', 'much', '?'], ['Juuuuuuuuuuuuuuuuussssst', 'Chillin', '!!'], ['Sunny', 'Again', 'Work', 'Tomorrow', ':-|', 'TV', 'Tonight'], ['handed', 'in', 'my', 'uniform', 'today', '.', 'i', 'miss', 'you', 'already'], ['hmmmm', '....', 'i', 'wonder', 'how', 'she', 'my', 'number', '@-)'], ['I', 'must', 'think', 'about', 'positive', '..'], ['thanks', 'to', 'all', 'the', 'haters', 'up', 'in', 'my', 'face', 'all', 'day', '!', '112', '-102'], ['this', 'weekend', 'has', 'sucked', 'so', 'far'], ['jb', 'isnt', 'showing', 'in', 'australia', 'any', 'more', '!'], ['ok', 'thats', 'it', 'you', 'win', '.'], ['<--------', 'This', 'is', 'the', 'way', 'i', 'feel', 'right', 'now', '...'], ['awhhe', 'man', '....', 'I', "'m", 'completely', 'useless', 'rt', 'now', '.', 'Funny', ',', 'all', 'I', 'can', 'do', 'is', 'twitter', '.', 'http', '://myloc.me/27HX']]