[关闭]
@xmruibi 2015-07-13T17:25:21.000000Z 字数 12467 阅读 727

Information Retrieval (Part One)

Machine_Learning


Chapter One

1 Boolean Query

Query by using boolean expression like AND, OR, NOT, with the query terms.

2. Ad Hoc Retrieval Task

That means we don't need the particular words in our query to retrieval information but any related word or similar concept can also get the result of what we want.
Our example above was rather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture.

3 Evaluate the effectiveness

4 Posting List / Inverted List (倒排表)

Term (Term Doc Frequency): Doc ID list (Size == Term Doc Frequency)
e.g. 'term' -> doc2, doc3, doc10, doc11

*Term Doc Frequency: How many document has appeared a certain term.


Chapter Two

1. Tokenization

most of situation rely on the space, but it also cause some confusion
dash may not is the segementation symbol

2. Stop Word

a, the, and, be ...

3. Token Normalization

4. Skip List

When P is the length of the skip list, using P as the skip pointer, set skip pointer at every P place

5. Biword

6. Positional index

Includes the position info:

  1. word, collection frequency:
  2. {[docID, docFrequency:[docPosition1, docPosition2]]};

Chapter Three

1. Wildcard Query

Posting List of n-gram:
metric, retrieval, petrify, beetric

etr: metric, retrieval, petrify, beetric

2. Spelling Correction

1. Isolated-term

2. Context-sensitive


Chapter Four

1. Map-Reduce Inverted Indexing:


Chapter Five: Index Compression

0. Pre-words

1. Heaps' Law: Estimating the Number of Terms

M=kTb

T: Token amount;
k: 30k100
b: 0.5
M: the distinct term amount

2. Zipf's Law: Modeling the Distribution of Terms

cfi1i

So if the most frequent term occurs cf1 times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank. Aboved equation is one of the simplest ways of formalizing.

3. Dictionary Compression


Chapter Six: Scoring, Term Weighting, and Vector Space Model

1. Concept of Field and Zone:

Title, author, create date ... is what we call the field or zone
Field is for short text while the zone may contain larger text;

2. Weight the field or zone:

weight learning: training set, manually annotation

Traninng Example:

e.g. T for title, b for body;

St Sb Score
0 0 0
0 1 1g
1 0 g
1 1 1

3. Term Frequency and Weight Calculation

From Document Frequency to Collection Frequency

Inverse Document Frequency:

N: how many documents in entire collection
dft: term in document frequency

idft=logNdft

tfidft,d assigns to term t a wieght in document d that is:

4. Vector Space Model


Chapter Eight: Evaluation

Overfitting

occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the model or algorithm shows low bias but high variance. Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Also means higher degree of a polynomial. N points in graph youcan make it N-1 order polynomial.

Underfitting

occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias. Underfitting is often a result of an excessively simple model.

Precision:

Precision=RetrivedRelevantTotalRetrived

Recall Rate:

Recall=RetrivedRelevantTotalRelevant

Accuracy:

Accuracy=TP+TNTP+TN+FP+FN

F-Score:

F=2RecallPrecisionPrecision+Recall

Confusion Matrix

True positive rate (TPR):

Sensitivity,Recall=ΣTruepositiveΣConditionpositive

False positive rate (FPR):

Fallout=ΣFalsepositiveΣConditionnegative

True negative rate (TNR):

Specificity(SPC)=ΣTruenegativeΣConditionnegative

ROC

Receiver Operating Characteristic Curve
It is plotting the true positive rate against the false positive rate.
Y-asix is for recall rate, x-asix is for

It show the performance of a binary classifier system, in fact, the information retrieval system can be interpret as binary classifier system, as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at various threshold settings.

P@10, P@30 ...

This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at k”, for example “Precision at 10”. It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable of the commonly used evaluation measures and that it does not average well, since the total number of relevant documents for a query has a strong influence on precision at k.

R-Precision


Chapter Nine: Relevence Feedback and Query Expansion

Query Refinement:

Global Method: (Independent, without document results)

Techniques for expanding or reformulating query terms independent of the query and results returned from it:
- Query Expansion/Reformulation with a thesaurus or WordNet
- via automatic thesaurus generation
- spelling correction

Local Method:(dependent, rely on document results)

Adjust a query relative to the documents that initially appear to match the query.
- Relevance Feedback
- Pseudo Relevance Feedback
- Global indirect relevance feedback

Relevance Feedback:

Involve the user in the retrieval process so as to improve the final result set.

Pseudo Relevance Feedback:

A automatic local analysis. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.

Indirect relevance feedback (Implicit)

Belong to Clickstream Mining
Using the click rate data, the ranking should be more highly when user chose to look at something more often.

Rocchio Algorithm

Query Expansion

The most common form of query expansion is using some form of thesaurus. For each term in query, the query can be automatically expanded with synonyms and related words from the thesaurus. However, the weight of added terms should be less than original query terms.

Query expansion is offten effective in increasing recall rate. However, there is a high cost to manually producing a thesaurus and then updating it for scientific and terminological development. At the mean time, query expansion may also siginificantly decrease precision, particularly when the query contains ambiguous terms.


Chapter Eleven: Probabilistic Information Retrieval

Bayes Theory:

P(B|A)=P(A|B)P(B)P(A|B)P(B)+P(A| B)P( B)

收缩起来就是:

P(B|A)=P(AB)P(A)

其实这个就等于:

P(B|A)P(A)=P(AB)

Chain Rule

P(A,B)=P(AB)=P(B|A)P(A)=P(A|B)P(B)

Binary Independent Model

We model the probability P(R|d,q) that a document is relevant via the probability in terms of term incidence vectors P(R|x⃗ ,q⃗ ). Using bayes rule, we have:

P(R=1|x⃗ ,q⃗ )=P(x⃗ |R=1,q⃗ )P(R=1|q⃗ )P(x⃗ |q⃗ )

P(R=0|x⃗ ,q⃗ )=P(x⃗ |R=0,q⃗ )P(R=0|q⃗ )P(x⃗ |q⃗ )

And also:

P(R=0|x⃗ ,q⃗ )+P(R=1|x⃗ ,q⃗ )=1

Deriving a ranking function for query terms

Odd of relevance:

O(R|x⃗ ,q⃗ )=P(R=1|x⃗ ,q⃗ )P(R=0|x⃗ ,q⃗ )=P(x⃗ |R=1,q⃗ )P(R=1|q⃗ )P(x⃗ |q⃗ )P(x⃗ |R=0,q⃗ )P(R=0|q⃗ )P(x⃗ |q⃗ )=P(x⃗ |R=1,q⃗ )P(R=1|q⃗ )P(x⃗ |R=0,q⃗ )P(R=0|q⃗ )

Document Relevant Non-relevant
Term Present(Xt=1) pt (s) ut(dfts)
Term Absent(Xt=0) 1pt (Ss) 1ut (Ndft)(Ss)

O(R|x⃗ ,q⃗ )=O(R|q⃗ )t:xt=qt=1ptutt:xt=0,qt=11pt1ut

Maxmium Likelihood Estimate

For trials with categorical outcomes (such as noting the presence or absence of a term), one
way to estimate the probability of an event from data equal to:

NumberOfTimesAnEventOccurredTotalNumberOfTrials

This is referred to as the relative frequency of the event.
Estimating the probability as the relative frequency is the maximum likelihood estimate (or MLE), because this value makes the observed data maximally likely. However, if we simply use the MLE, then the probability given to events we happened to see is usually too high, whereas other events may be completely unseen and giving them as a probability estimate their relative frequency of 0 is both an underestimate, and normally breaks our models, since anything multiplied by 0 is 0. Simultaneously decreasing the estimated probability of seen events and increasing the probability of unseen events is referred to as smoothing. One simple way of smoothing is to add a number α to each of the observed counts.

RSV - Retriveval Statucs Value

BM25

Basically to say, it is a method to add the wieght on query term with term frequency and document length.


Chapter Twelve: Language Model

Probability Model vs Language Model

Language Model

Multiple-Bernoulli Model vs Multimonimal Model:

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注