@HaomingJiang 2016-06-24T15:33:55.000000Z 字数 2056 阅读 2123

Chp5 Data Mining Essentials

笔记 SMM(SocialMediaMining)

Chp5 Data Mining Essentials

5.1 Data

In social media, individuals generate many types of nontabular data, such as text, voice, or vedio. They are first coverted to tabular data. (i.e. use the fast Fourier transoform(FFT) to process voice; use vectorization to process text, a variety of that is vector-space model)

Vector Space Model
We are given a set of documents $D$ . Use the dictionary of all possible words as the features. Each document is a set of words. The goal is to use feature vector to represent document.
$d_i=(w_{1,i},w_{2,i},...,w_{n,i})$
Naive way: We can set $w_{j,i}=1$ when the $j^{th}$ word exists, or 0 when it doesn't.
General way: term frequency-inverse document frequency(TF-IDF)
$w_{j,i}=tf_{j,i} \times idf_{j}$
where $tf_{j,i}$ is the frequency of word j in document i.
$idf_{j}=log_2(\frac{|D|}{|{document \in D}|j \in document|'})$ is the inverse frequency of word j across all documents.

5.1.1 Data quality

Noise
Outliers
Missing Values
Duplicate Data

5.2 Data Preprocessing

Aggregation
Discretization
Feature Selection
Feature Extraction
Sampling:(i.e. stratified sampling can deal with imbalance data)
In social media, many data is presented in the form of network. We can just sample nodes and edges using aforementioned sampling methods. An alternative way is illustrated below:
(a) sample a small set of nodes (seed nodes)
(b) find the connected components they belong to
(c) find the set of nodes (and edges) connected to them directly
(d) find the set of nodes and edges that are within n-hop distance from them.

5.3 Supervised Learning

Decision Tree
Naive Bayes Classifier
Nearest Neighbor Classifier

Classification with Network Information
If we only have the information about connections and class labels (i.e, bought or will not buy).
Let P(y_i=1|N(v_i)) denote the probability of node $v_i$ having class attribute value 1 given its neighbors. Approximately, we can use the neighbors' information to deduct the probability.
i.e. (weighted-vote relational-neigbhor, wvRN)
$P(y_i=1|N(v_i))=\frac{1}{|N(v_i)|}\sum_{v_i\in N(v_i)}P(y_j=1|N(v_j))$
It can strat with an initial probability, (for unlabel data, the probability is 0.5). Eventually, it will converge to an distribution.

Regression

5.4 Unsupervised Learning

Clustering