**Doc2Vec Sentiment Analysis Tutorial on the Airline Tweets** - 作业部落 Cmd Markdown 编辑阅读器

@HaomingJiang 2016-08-12T14:29:41.000000Z 字数 5038 阅读 9782

Doc2Vec Sentiment Analysis Tutorial on the Airline Tweets

Tweets Textmining Doc2Vec

This is a tutorial about the procedue of using doc2vec to do sentiment analysis on airline tweets. Although the result is not very beautiful, by this tutorial you still can learn the procedue of sentiment analysis via Gensim Doc2Vec.

The main procedue is:
1. Text Preprocessing
2. Building Doc2Vec Model
3. Building Sentiment Classifier

Data Preparation

The airline tweets data can be collected from Kaggle. It is a small corpus. In order to get a great document vector, a larger data set is needed. A larger data set, which contains about 55,000 airline tweets can be downloaded here. In addition, I collect about 55,000 unlabeled data from twitter API. It can be downloaded here

PS: Here I renamed the labeled data as "Tweets_NAg.csv" for simplisity.

Text Preprocessing

A good introduction about how to preprocess tweet data can be found here. The main purpose of this step is to tokenize the meaningful word unit from original tweet. Based on the provided code of sentiment tokenizer, I did a slight modification including nomalizing the URLs and Usernames. The file you need is ReadACleanT.py

Put the files in the same fold. Let's load and clean the data.

from ReadACleanT import clean_tweet
import pandas as pd
import random
df = pd.read_csv('Tweets_NAg.csv')
df = df[[u'airline_sentiment',u'text']]
df.loc[:,'text'] = df.loc[:,'text'].map(clean_tweet)
udf = pd.read_csv('../data/Tweets_Unlabeled.csv')
udf = udf[[u'text']]
udf.loc[:,'text'] = udf.loc[:,'text'].map(clean_tweet)

Constructing Doc2Vec Model

First we need to turn our data into TaggedDocument form, which is the input form of Doc2Vec.

from gensim.models.doc2vec import TaggedDocument,Doc2Vec
TotalNum = df.size/2
TotalNum_Unlabed = udf.size
TestNum =3000
TrainNum=TotalNum-TestNum
documents = [TaggedDocument(list(df.loc[i,'text']),[i]) for i in range(0,TotalNum)]
documents_unlabeled = [TaggedDocument(list(udf.loc[i,'text']),[i+TotalNum]) for i in range(0,TotalNum_Unlabed)]
documents_all = documents+documents_unlabeled
Doc2VecTrainID = range(0,TotalNum+TotalNum_Unlabed)
random.shuffle(Doc2VecTrainID)
trainDoc = [documents_all[id] for id in Doc2VecTrainID]
Labels = df.loc[:,'airline_sentiment']

After that we can construct the Doc2Vec model. According to the paper, I train two seprate models (i.e. DM and DBOW). Later I will concatenate two vectors into one. For the model parameter setting, you can refer to the document.

import multiprocessing
cores = multiprocessing.cpu_count()
model_DM = Doc2Vec(size=400, window=8, min_count=1, sample=1e-4, negative=5, workers=cores,  dm=1, dm_concat=1 )
model_DBOW = Doc2Vec(size=400, window=8, min_count=1, sample=1e-4, negative=5, workers=cores, dm=0)

Next, we are going to build vocabulary for models.

model_DM.build_vocab(trainDoc)
model_DBOW.build_vocab(trainDoc)

Now, we are ready to train Doc2Vec model. Here I use both the testing data and training data to get the vector. It seems violate the normal process. However, this step is unsupervised, we don't use the true label information of testing set.
It will take a lot of time, it's time to have a break.

for it in range(0,10):
    random.shuffle(Doc2VecTrainID)
    trainDoc = [documents_all[id] for id in Doc2VecTrainID]
    model_DM.train(trainDoc)
    model_DBOW.train(trainDoc)

Building Classifier

Here I simply use softmax to build the classifier. Neural network and SVM may achieve higher performance. Try them if you want.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import statsmodels.api as sm
random.seed(1212)
newindex = random.sample(range(0,TotalNum),TotalNum)
testID = newindex[-TestNum:]
trainID = newindex[:-TestNum]
train_targets, train_regressors = zip(*[(Labels[id], list(model_DM.docvecs[id])+list(model_DBOW.docvecs[id])) for id in trainID])
train_regressors = sm.add_constant(train_regressors)
predictor = LogisticRegression(multi_class='multinomial',solver='lbfgs')
predictor.fit(train_regressors,train_targets)

Let's use G-mean (Geometric mean of recalls) and accuracy to evaluate the model.

accus=[]
Gmeans=[]
test_regressors = [list(model_DM.docvecs[id])+list(model_DBOW.docvecs[id]) for id in testID]
test_regressors = sm.add_constant(test_regressors)
test_predictions = predictor.predict(test_regressors)
for i in range(0,TestNum):
    if test_predictions[i]==df.loc[testID[i],u'airline_sentiment']:
        accu=accu+1
accus=accus+[1.0*accu/TestNum]
confusionM = confusion_matrix(test_predictions,(df.loc[testID,u'airline_sentiment']))
Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]
train_predictions = predictor.predict(train_regressors)
accu=0
for i in range(0,len(train_targets)):
    if train_predictions[i]==train_targets[i]:
        accu=accu+1
accus=accus+[1.0*accu/len(train_targets)]
confusionM = confusion_matrix(train_predictions,train_targets)
Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]

In my machine, the performance is

data	G-Mean	Accuracy
testing set	0.6225	0.7687
training set	0.6297	0.7755

You can set different random seed for multiple runs or use cross validation to evaluate the model.