In social media, individuals generate many types of nontabular data, such as text, voice, or vedio. They are first coverted to tabular data. (i.e. use the fast Fourier transoform(FFT) to process voice; use vectorization to process text, a variety of that is vector-space model)
Vector Space Model
We are given a set of documents . Use the dictionary of all possible words as the features. Each document is a set of words. The goal is to use feature vector to represent document.
Naive way: We can set when the word exists, or 0 when it doesn't.
General way: term frequency-inverse document frequency(TF-IDF)
where is the frequency of word j in document i.
is the inverse frequency of word j across all documents.
Naive Bayes Classifier
Nearest Neighbor Classifier
Classification with Network Information
If we only have the information about connections and class labels (i.e, bought or will not buy).
Let P(y_i=1|N(v_i)) denote the probability of node having class attribute value 1 given its neighbors. Approximately, we can use the neighbors' information to deduct the probability.
i.e. (weighted-vote relational-neigbhor, wvRN)
It can strat with an initial probability, (for unlabel data, the probability is 0.5). Eventually, it will converge to an distribution.