@xmruibi 2015-07-13T17:25:21.000000Z 字数 12467 阅读 727

Information Retrieval (Part One)

Machine_Learning

Chapter One

1 Boolean Query

Query by using boolean expression like AND, OR, NOT, with the query terms.

2. Ad Hoc Retrieval Task

That means we don't need the particular words in our query to retrieval information but any related word or similar concept can also get the result of what we want.
Our example above was rather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture.

3 Evaluate the effectiveness

Precision: What fraction of the returned results are relevant to the information need?
Recall Rate: What fraction of the relevant documents in the collection were returned
by the system?

4 Posting List / Inverted List (倒排表)

Term (Term Doc Frequency): Doc ID list (Size == Term Doc Frequency)
e.g. 'term' -> doc2, doc3, doc10, doc11

*Term Doc Frequency: How many document has appeared a certain term.

Chapter Two

1. Tokenization

most of situation rely on the space, but it also cause some confusion
dash may not is the segementation symbol

2. Stop Word

a, the, and, be ...

3. Token Normalization

Stemming
Lemmatization

4. Skip List

When $P$ is the length of the skip list, using $\sqrt{P}$ as the skip pointer, set skip pointer at every $\sqrt{P}$ place

5. Biword

6. Positional index

Includes the position info:

word, collection frequency:
    {[docID, docFrequency:[docPosition1, docPosition2]]};

Chapter Three

1. Wildcard Query

Strategy One: Permuterm Index.
Space consuming: hello: hello$ -> ello$h -> llo$he -> lo$hel -> o$hell ->$hello
Strategy Two: k-gram Index.
One word with k-gram:
castle: $ca, cas, ast,stl,tle, le$

Posting List of n-gram:
metric, retrieval, petrify, beetric

etr: metric, retrieval, petrify, beetric

2. Spelling Correction

1. Isolated-term

Edit Distance
K-gram Index

2. Context-sensitive

Chapter Four

1. Map-Reduce Inverted Indexing:

Map:
Docments -> list{term, docId}
Reduce:
{(term1, {docId...}), (term2, {docId...})} -> {term, (docId1:docFreq, docId2:docFreq)}

Chapter Five: Index Compression

0. Pre-words

Lossless Compression
Lossy Compression
Uppercase/Lowercase transfer, Lemmatization, Stop words removal;

Distinct Term:
Nonpositional Posting Record:
Token:

1. Heaps' Law: Estimating the Number of Terms

M = k T b

$M = kT^b$

T: Token amount;
k: $30 \leq k \leq 100$
b: $\approx 0.5$
M: the distinct term amount

2. Zipf's Law: Modeling the Distribution of Terms

c f i \propto 1 i

$cf_i \propto \frac{1}{i}$

So if the most frequent term occurs cf1 times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank. Aboved equation is one of the simplest ways of formalizing.

3. Dictionary Compression

Chapter Six: Scoring, Term Weighting, and Vector Space Model

1. Concept of Field and Zone:

Title, author, create date ... is what we call the field or zone
Field is for short text while the zone may contain larger text;

2. Weight the field or zone:

weight learning: training set, manually annotation

Traninng Example:

e.g. T for title, b for body;

$S_t$	$S_b$	Score
$0$	$0$	$0$
$0$	$1$	$1-g$
$1$	$0$	$g$
$1$	$1$	$1$

3. Term Frequency and Weight Calculation

From Document Frequency to Collection Frequency

Inverse Document Frequency:

$N$ : how many documents in entire collection
$df_t$ : term in document frequency

i d f t = l o g N d f t

$idf_t = log\frac{N}{df_t}$

$tf-idf_t,_d$ assigns to term $t$ a wieght in document $d$ that is:

Highest when $t$ occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occur fewer times in one document or much times in the total collection.

4. Vector Space Model

Dot Product: Simialrity between two text: consider the dot product between two document vectors.
the cosine similarity of their vector representations
Length of vector: $\sqrt{\sum_{i=0}{X_i^2}}$

Chapter Eight: Evaluation

Overfitting

occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the model or algorithm shows low bias but high variance. Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Also means higher degree of a polynomial. N points in graph youcan make it N-1 order polynomial.

Underfitting

occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias. Underfitting is often a result of an excessively simple model.

Precision:

P r e c i s i o n = R e t r i v e d R e l e v a n t T o t a l R e t r i v e d

$Precision = \frac{Retrived Relevant}{Total Retrived}$

Recall Rate:

R e c a l l = R e t r i v e d R e l e v a n t T o t a l R e l e v a n t

$Recall = \frac{Retrived Relevant}{Total Relevant}$

Accuracy:

A c c u r a c y = T P + T N T P + T N + F P + F N

$Accuracy = \frac{T_P + T_N}{T_P+T_N+F_P+F_N}$

F-Score:

F = 2 * R e c a l l * P r e c i s i o n P r e c i s i o n + R e c a l l

$F = \frac{2*Recall*Precision}{Precision+Recall}$

Confusion Matrix

True positive rate (TPR):

S e n s i t i v i t y, R e c a l l = Σ T r u e p o s i t i v e Σ C o n d i t i o n p o s i t i v e

$Sensitivity, Recall = \frac{Σ True positive}{ Σ Condition positive}$

False positive rate (FPR):

F a l l - o u t = Σ F a l s e p o s i t i v e Σ C o n d i t i o n n e g a t i v e

$Fall-out = \frac{Σ False positive}{Σ Condition negative}$

True negative rate (TNR):

S p e c i f i c i t y (S P C) = Σ T r u e n e g a t i v e Σ C o n d i t i o n n e g a t i v e

$Specificity (SPC) = \frac{Σ True negative}{Σ Condition negative}$

ROC

Receiver Operating Characteristic Curve
It is plotting the true positive rate against the false positive rate.
Y-asix is for recall rate, x-asix is for

It show the performance of a binary classifier system, in fact, the information retrieval system can be interpret as binary classifier system, as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at various threshold settings.

P@10, P@30 ...

This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at k”, for example “Precision at 10”. It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable of the commonly used evaluation measures and that it does not average well, since the total number of relevant documents for a query has a strong influence on precision at k.

R-Precision

Chapter Nine: Relevence Feedback and Query Expansion

Global Method: (Independent, without document results)

Techniques for expanding or reformulating query terms independent of the query and results returned from it:
- Query Expansion/Reformulation with a thesaurus or WordNet
- via automatic thesaurus generation
- spelling correction

Local Method:(dependent, rely on document results)

Adjust a query relative to the documents that initially appear to match the query.
- Relevance Feedback
- Pseudo Relevance Feedback
- Global indirect relevance feedback

Relevance Feedback:

Involve the user in the retrieval process so as to improve the final result set.

Two assumption:
- Users have sufficient knowledge to be able to make an initial query
- Requires relevant documents to be similar to each other.

Pseudo Relevance Feedback:

A automatic local analysis. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.

Indirect relevance feedback (Implicit)

Belong to Clickstream Mining
Using the click rate data, the ranking should be more highly when user chose to look at something more often.

Rocchio Algorithm

Optimal query vector $\vec{q}_{opt}$ for separating the relevant and nonrelevant documents. It is the vector difference bewteen the centroids of the relevant and nonrelevant documents.
Centroid Vector：

Query Expansion

The most common form of query expansion is using some form of thesaurus. For each term in query, the query can be automatically expanded with synonyms and related words from the thesaurus. However, the weight of added terms should be less than original query terms.

Query expansion is offten effective in increasing recall rate. However, there is a high cost to manually producing a thesaurus and then updating it for scientific and terminological development. At the mean time, query expansion may also siginificantly decrease precision, particularly when the query contains ambiguous terms.

Chapter Eleven: Probabilistic Information Retrieval

Bayes Theory:

P (B | A) = P ( A | B ) * P ( B ) P ( A | B ) * P ( B ) + P ( A | B ) * P ( B )

$P(B|A) = \frac{P(A|B) * P(B) }{P(A|B) * P(B) + P(A|~B) * P(~B)}$

收缩起来就是：

P (B | A) = P ( A B ) P ( A )

$P(B|A) = \frac{P(AB)}{P(A)}$

其实这个就等于：

P (B | A) * P (A) = P (A B)

$P(B|A) * P(A) = P(AB)$

Chain Rule

P (A, B) = P (A \cap B) = P (B | A) * P (A) = P (A | B) * P (B)

$P(A,B) = P(A \cap B)= P(B|A) * P(A)= P(A|B) * P(B)$

Binary Independent Model

We model the probability $P(R|d,q)$ that a document is relevant via the probability in terms of term incidence vectors $P(R|\vec{x}, \vec{q})$ . Using bayes rule, we have:

P (R = 1 | x ⃗, q ⃗) = P ( x ⃗ | R = 1 , q ⃗ ) * P ( R = 1 | q ⃗ ) P ( x ⃗ | q ⃗ )

$P(R= 1|\vec{x}, \vec{q}) = \frac{P(\vec{x}|R=1, \vec{q}) * P(R=1| \vec{q})}{P(\vec{x}|\vec{q})}$

P (R = 0 | x ⃗, q ⃗) = P ( x ⃗ | R = 0 , q ⃗ ) * P ( R = 0 | q ⃗ ) P ( x ⃗ | q ⃗ )

$P(R= 0|\vec{x}, \vec{q}) = \frac{P(\vec{x}|R=0, \vec{q}) * P(R=0| \vec{q})}{P(\vec{x}|\vec{q})}$

And also:

P (R = 0 | x ⃗, q ⃗) + P (R = 1 | x ⃗, q ⃗) = 1

$P(R= 0|\vec{x}, \vec{q}) + P(R= 1|\vec{x}, \vec{q}) = 1$

Deriving a ranking function for query terms

Odd of relevance:

O (R | x ⃗, q ⃗) = P ( R = 1 | x ⃗ , q ⃗ ) P ( R = 0 | x ⃗ , q ⃗ ) = P ( x ⃗ | R = 1 , q ⃗ ) * P ( R = 1 | q ⃗ ) P ( x ⃗ | q ⃗ ) P ( x ⃗ | R = 0 , q ⃗ ) * P ( R = 0 | q ⃗ ) P ( x ⃗ | q ⃗ ) = P ( x ⃗ | R = 1 , q ⃗ ) * P ( R = 1 | q ⃗ ) P ( x ⃗ | R = 0 , q ⃗ ) * P ( R = 0 | q ⃗ )

$O(R|\vec{x}, \vec{q}) = \frac{P(R= 1|\vec{x}, \vec{q})}{P(R= 0|\vec{x}, \vec{q})} = \frac{\frac{P(\vec{x}|R=1, \vec{q}) * P(R=1| \vec{q})}{P(\vec{x}|\vec{q})}}{\frac{P(\vec{x}|R=0, \vec{q}) * P(R=0| \vec{q})}{P(\vec{x}|\vec{q})}} = \frac{P(\vec{x}|R=1, \vec{q}) * P(R=1|\vec{q})}{P(\vec{x}|R=0, \vec{q}) * P(R=0|\vec{q})}$

Document	Relevant	Non-relevant
Term Present( $X_t = 1$ )	$p_t$ $(s)$	$u_t$ ( $df_t-s$ )
Term Absent( $X_t = 0$ )	$1- p_t$ ( $S-s$ )	$1 -u_t$ $(N-df_t)-(S-s)$

O (R | x ⃗, q ⃗) = O (R | q ⃗) \cdot \prod t : x t = q t = 1 p t u t \cdot \prod t : x t = 0, q t = 1 1 - p t 1 - u t

$O(R|\vec{x}, \vec{q}) = O(R|\vec{q})\cdot\prod_{t:x_t=q_t=1}^{} \frac{p_t}{u_t} \cdot \prod_{t:x_t=0,q_t=1}^{} \frac{1-p_t}{1-u_t}$

Maxmium Likelihood Estimate

For trials with categorical outcomes (such as noting the presence or absence of a term), one
way to estimate the probability of an event from data equal to:

N u m b e r O f T i m e s A n E v e n t O c c u r r e d T o t a l N u m b e r O f T r i a l s

$\frac{ Number Of Times An Event Occurred }{Total Number Of Trials}$
This is referred to as the relative frequency of the event.
Estimating the probability as the relative frequency is the maximum likelihood estimate (or MLE), because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing. One simple way of smoothing is to add a number α to each of the observed counts.

RSV - Retriveval Statucs Value

BM25

Basically to say, it is a method to add the wieght on query term with term frequency and document length.

Advanced IDF calculation method 1:
$R S V d = \sum t \in q l o g N - d f t + 1 2 d f t + 1 2$ $RSV_d = \sum_{t\in{q}}^{}{log\frac{N-df_t+\frac{1}{2}}{df_t+\frac{1}{2}}}$
But the situation may occurs that a term appear times greater than half of document amoun, which lead to the weight is negative number.

Chapter Twelve: Language Model

Probability Model $vs$ Language Model

Comparsion between Probability Model and Language Model

Similar in some ways
    Term weights based on frequency
    Terms often used as if they were independent
    Inverse document/collection frequency used
    Some form of length normalization useful

Different in others
    Based on probability rather than similarity
        Intuitions are probabilistic rather than geometric
    Details of use of document length and term, document, and collection frequency differ

Language Model

Language Model:
Using language model can be categoried into 3 types:
```
Query likelihood, Document likelihood, Model comparison
```
- Query likelihood:
  
  $P (Q | M D) = P (q 1 . . . q k | M D) = \prod i = 1 k P (q i | M D)$ $P(Q|M_D) = P(q_1...q_k|M_D) = \prod_{i=1}^{k}{P(q_i|M_D)}$
  Why smoothing?? Smoothing methods try to discount the probability of words seen in a document; re-allocate the extra counts so that unseen words will have a non-zero count
  - Linear Interpolation Method:
  $P^(t | d) = λ P m l e^(t | M d) + (1 - λ) P m l e^(t | M c)$ $\hat{P}(t|d)= \lambda \hat{P_{mle}}(t|M_d) +(1-\lambda) \hat{P_{mle}}(t|M_c)$
  - Dirichlet Prior Smoothing:
  Using a reference model (collection language model) to discriminate unseen words.
  $P (w | D) = c ( w , M D ) + μ \cdot P ( w | M C ) | D | + μ$ $P(w|D) = \frac{c(w,M_D)+\mu\cdot P(w|M_C)}{|D|+\mu}$
  |D| means the length of current document!
  $c(w,M_D)$ means the term frequency in a document
- Document likelihood:
  
  $P (D | M Q) = \prod w \in D P (w | M Q)$ $P(D|M_Q) = \prod_{w\in{D}}{P(w|M_Q)}$
  - Estimate a language model MQ for query Q
  - Using Bayes' likelihood that MQ is the source, given that we observed document D.

Multiple-Bernoulli Model $vs$ Multimonimal Model:

Multimonimal:
What is the identity of the i'th query token?
Observation is a sequence of events, one for each query token
$P (q 1 . . .1 k | M) = \prod i = 1 k P (q i | M)$ $P(q_1...1_k|M) = \prod_{i=1}^{k}{P(q_i|M)}$
Multiple-Bernoulli:
Similar to Binary Independent Model;
Does the word w occur in the query?
Observation is a vector of binary events, one for each possible word.
$P (q 1 . . .1 k | M) = \prod w \in q 1 . . . q k P (w | M) \prod w \notin q 1 . . . q k [1 - P (w | M)]$ $P(q_1...1_k|M) = \prod_{w\in{q_1...q_k}}{P(w|M)}\prod_{w\notin{q_1...q_k}}{[1-P(w|M)]}$