@HaomingJiang 2016-07-08T13:44:49.000000Z 字数 6052 阅读 2294

Tweets Analysis

Tweets Textmining

Tweets Analysis

Tools

R
R Packages: RSQLite tm stringr stringi etc.
Dataset : Twitter US airline sentiment
https://www.kaggle.com/crowdflower/twitter-airline-sentiment

Problem

supervised tweet sentiment analysis

Challenges：

strange expression(e.g. noooooooooooooooooooooope, :) )
Negation can sometimes be partially captured by studying bigrams (two word pairs: e.g. not good, not bad) instead of single words.
imbalance data
time affect? don't know...maybe...(but i think it is not a prominent factor here, compared to the problem in reference paper about concept drift. the problem is mainly about natural language processing. The meaning of the words will not change quickly.)
some express comparative sentiment (e.g. "A is good, while B is bad." Maybe actually it mainly talks about how good is A, the algorithm may also extract 'good' and 'bad' at the same time.)

Workflow

preprocessing
tokenizing
feature selection
train and evaluate classifier

Preprocessing and Constructing Features

Set the encoding to "UTF-8"

Encoding(texts)<-"UTF-8"

Characteristics of tweets:

url:
It is rubbish information, unless we can access and analysis the information of it. It also seems useless to mine information from the frequency, the most common url appears for only 4 times.
The urls are short urls(e.g. http://t.co/9xkiy0kq2j). They can be expressed by regular expression

'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-f][0-9a-f]))+'

emoji:
useful character

emotion ( ':)' ':(' ):
useful character

#hash-tags:
Hashtags can be useful, extract them and take them as an individual feature? (It may be not feasible, the most common hashtag (#destinationdragons) appear 78 times among 14485 records, the frequence is 0.5%). So I just treat it as a normal word.

@mentions:
useless. basically they mention another airline or another user.
@airline:
It may be useful. However not every tweet has this, some even has two of them when expressing comparative opinions.
As a result: remove all metions by regular expression.

'@\\w+'

Numbers:
They are related to time(fast:positive or slow:negative),date(not very useful),flight number(not very useful),money(cheap:positive or expensive:negative), etc. As a result, anything cohere to number is removed.
the regular expression is as follows:

"\\S*\\d+\\S*"

Finally remove words, including stopwords, punctuation (it will damage emotions such as ":)" and ":(" ) and meaningless words('get', 'cant', 'can', 'now', 'just', 'will', 'dont', 'ive', 'got', 'much', 'amp'(this one is actually '&amp', a html simbol))

(PS: About stemming, a common techniques applied in NLP (e.g. 'doing', 'did', 'done', 'does' are all derived from 'do)': it will actually damage the overall performance. The conclusion is showed by my own tests (it undermine the 0.3% accuaracy) and others' studies in http://sentiment.christopherpotts.net/stemming.html)

Use tokenizer to tokenize the tweet. (Here I extract both bigrams(two words, so that features like not good can be extracted) and unigrams(one word))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1,max=2))

Simply use the R function to select features by removing sparseterms:

datmat <- removeSparseTerms(datmat, 0.999)

(Problems: After removing sparse terms, some tweet's feature vetor is 0)

How to represent text messages into vectors for learning?

Methods:
TF-IDF, Frequencies, Occurrence(if a word appear, the corresponding feature is denoted as 1, otherwise it is 0)
Word occurrence may matter more than word frequency. (it is a conclusion revealed by reference paper and myown attempts (frequency will undermine 1% accuarcy))

How to evaluate the quality of the extracted features?

CHP6.2 in mining text data (http://charuaggarwal.net/text-content.pdf).
Also some good feature selection method is introduced in the above material.

Prediction

The data is skewed:
positive 2334
negative 9082
netural 3069

Models:
Naive Bayes, Maximum Entropy, AbaBoost, SVM etc.

oversampling or undersampling ？
Adaboost.NC?

How to evaluate the performance of the predictive models?

Confusion Matrix
Accuracy
G-mean

Models

Model(corpus-based sentiment classification)

Description：a crude model without fine parameter adjustment
Enviroment: R
Method: occurrence vectorization & Naive Bayesian classifier
training set size:10000
testing set size:4485
Result:
Confusion Matrix

prediction	negative	neutral	positive
negative	2272	215	92
neutral	379	571	118
positive	178	136	524

accuarcy(75.07%) (the highest accuarcy that I have ever achieved)

protential improvements(to be done):
1. Handling Muticlass Imbalance (OOB,UOB,Adaboost.NC)
2. Better tokenizer (extract emoji and emotion expressions from tweet.) Some sentiment-aware tokenizer is already appliable in python, which has been proven can produce better performance. Actually some preprocessing procedure can be assembled in tokenizer.
3. Better features ( Latent Semantic Indexing )
4. Better classifier (some documents illustrate that SVM performance is usually better than Naive Bayesian)

Further Discussion

Despite the disscussed classifier way (Naive Bayesian) which is presented above, lexion-based classifier is also a basic way. Some preassambled lexion about sentiment is avaliable online (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html). It also utilized the corpus to rich the lexions.

This week, I mainly simply go through the whole text mining procedure with some explarotion. It is not systematic. In the following weeks, I want to do it in a more systematical way.

I think I might concentrate on feature excration first with classical classifier (Naive Bayesian or SVM etc.). Next, with the same preprocessing and handle the imbalance problem with well designed classifier (AdaBoost.NC). After that, I want to explore the possibility of handling the question specific imbalance problem, which means not only merely change the classifier, but also explore a text mining process including the preprocessing procedure and the classifier. Finally, if it is possible, I might add the consideration of time which turns it into an online learning text mining task.

Reference

https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Danneman, Nathan, and Heimann, Richard. Social Media Mining with R. Olton, Birmingham, GBR: Packt Publishing Ltd, 2014. ProQuest ebrary. Web. 7 July 2016.
mining text data (http://charuaggarwal.net/text-content.pdf)
http://sentiment.christopherpotts.net/index.html