@xmruibi
2015-07-13T17:25:21.000000Z
字数 12467
阅读 727
Machine_Learning
Query by using boolean expression like AND, OR, NOT, with the query terms.
That means we don't need the particular words in our query to retrieval information but any related word or similar concept can also get the result of what we want.
Our example above was rather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as pipeline rupture.
Precision: What fraction of the returned results are relevant to the information need?
Recall Rate: What fraction of the relevant documents in the collection were returned
by the system?
Term (Term Doc Frequency): Doc ID list (Size == Term Doc Frequency)
e.g. 'term' -> doc2, doc3, doc10, doc11
*Term Doc Frequency: How many document has appeared a certain term.
most of situation rely on the space, but it also cause some confusion
dash may not is the segementation symbol
a, the, and, be ...
Stemming
Lemmatization
When
Includes the position info:
word, collection frequency:{[docID, docFrequency:[docPosition1, docPosition2]]};
Strategy One: Permuterm Index.
Space consuming: hello: hello$ -> ello$h -> llo$he -> lo$hel -> o$hell ->$hello
Strategy Two: k-gram Index.
One word with k-gram:
castle: $ca, cas, ast,stl,tle, le$
Posting List of n-gram:
metric, retrieval, petrify, beetric
etr: metric, retrieval, petrify, beetric
Edit Distance
K-gram Index
Map:
Docments -> list{term, docId}
Reduce:
{(term1, {docId...}), (term2, {docId...})} -> {term, (docId1:docFreq, docId2:docFreq)}
Lossless Compression
Lossy Compression
Uppercase/Lowercase transfer, Lemmatization, Stop words removal;
Distinct Term:
Nonpositional Posting Record:
Token:
T: Token amount;
k:
b:
M: the distinct term amount
So if the most frequent term occurs cf1 times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank. Aboved equation is one of the simplest ways of formalizing.
Title, author, create date ... is what we call the field or zone
Field is for short text while the zone may contain larger text;
weight learning: training set, manually annotation
e.g. T for title, b for body;
| Score | ||
|---|---|---|
From Document Frequency to Collection Frequency
occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the model or algorithm shows low bias but high variance. Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.
Also means higher degree of a polynomial. N points in graph youcan make it N-1 order polynomial.
occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias. Underfitting is often a result of an excessively simple model.
Receiver Operating Characteristic Curve
It is plotting the true positive rate against the false positive rate.
Y-asix is for recall rate, x-asix is for
It show the performance of a binary classifier system, in fact, the information retrieval system can be interpret as binary classifier system, as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at various threshold settings.
This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at k”, for example “Precision at 10”. It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable of the commonly used evaluation measures and that it does not average well, since the total number of relevant documents for a query has a strong influence on precision at k.
Techniques for expanding or reformulating query terms independent of the query and results returned from it:
- Query Expansion/Reformulation with a thesaurus or WordNet
- via automatic thesaurus generation
- spelling correction
Adjust a query relative to the documents that initially appear to match the query.
- Relevance Feedback
- Pseudo Relevance Feedback
- Global indirect relevance feedback
Involve the user in the retrieval process so as to improve the final result set.
A automatic local analysis. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.
Belong to Clickstream Mining
Using the click rate data, the ranking should be more highly when user chose to look at something more often.
The most common form of query expansion is using some form of thesaurus. For each term in query, the query can be automatically expanded with synonyms and related words from the thesaurus. However, the weight of added terms should be less than original query terms.
Query expansion is offten effective in increasing recall rate. However, there is a high cost to manually producing a thesaurus and then updating it for scientific and terminological development. At the mean time, query expansion may also siginificantly decrease precision, particularly when the query contains ambiguous terms.
收缩起来就是:
其实这个就等于:
Chain Rule
We model the probability
And also:
Odd of relevance:
| Document | Relevant | Non-relevant |
|---|---|---|
| Term Present( |
||
| Term Absent( |
For trials with categorical outcomes (such as noting the presence or absence of a term), one
way to estimate the probability of an event from data equal to:
RSV - Retriveval Statucs Value
Basically to say, it is a method to add the wieght on query term with term frequency and document length.
Comparsion between Probability Model and Language Model
Similar in some ways
Term weights based on frequency
Terms often used as if they were independent
Inverse document/collection frequency used
Some form of length normalization useful
Different in others
Based on probability rather than similarity
Intuitions are probabilistic rather than geometric
Details of use of document length and term, document, and collection frequency differ
Language Model:
Using language model can be categoried into 3 types:
Query likelihood, Document likelihood, Model comparison
Query likelihood:
Document likelihood: