@xtccc 2015-12-08T01:52:16.000000Z 字数 4088 阅读 2370

Mapping, Index and Analyzers

ElasticSearch

参考
+ Mapping
+ Data In, Data Out
+ Analysis and Analyzers

Mapping

什么是mapping？

Types are also called mapping types because they’re typically used as containers for different types of documents—documents with different structures. The definition of fields in each type is called a mapping.

The mapping is automatically created with your new document, and it automatically detects your fields' types. If you add a new document with yet another new field, Elasticsearch guesses its type, too and appends the new field to the mapping.

Mapping这个过程定义了一个文档怎样被映射到搜索引擎，例如哪些fields是可以被搜索的，以及它们怎样被分词。

Explicit mapping 是定义在index/type这个层次上的。默认情况下我们不需要定义explicit mapping，因为ES会帮我们在暗地里创建，但是我们也可以改写ES创建的默认mapping。

通过 Put Mapping API，我们可以创建mapping，此外也可以在创建Index时添加多种mapping。

查看mapping

$ curl 'ecs1:9200/get-together/group/_mapping?pretty'
{
  "get-together" : {
    "mappings" : {
      "group" : {
        "properties" : {
          "name" : {
            "type" : "string"
          },
          "organizer" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

Document

在ElasticSearch中，all data in every field is indexed by default。也就是说，每一个field都有自己专门的倒排索引（inverted index）。

一个document可以用一个JSON object来表达。
一个document不仅仅包含数据，还包含metadata，至少要包含以下三个metadata：

_index: Where the doc lies; just like database in traditional db;

_type: The class of the object that the doc represents; just like table in traditional db; each type has it own mapping or schema definition, which defines the data structure for docs of that type;

_id: The unique identifier for the doc

Index

Create an Index

创建Index时，可以为每一种Index指定其特定的配置，这可以通过YAML或者JSON来进行，例如下面的例1和例2：

例1:

curl -XPUT localhost:9200/twitter -d '
index:
    number_of_shards: 3
    number_of_replicas: 2
'

例2：

curl -XPUT 'localhost:9200/twitter' -d ' 
{ 
 "settings" : {
     "index" : {
         "number_of_shards" : 3,
         "number_of_replicas" : 2
   }
 }
} '

关于Index的其他设置，可以参考Index Modules

Mappings

创建 Mapping

在创建Index时可以指定mapping：

curl -XPOST 'localhost:9200/twitter?pretty' -d ' {
  "settings" : {
      "number_of_shards" : 1
  },
  "mappings": {
      "type_1" : {
          "_source" : { "enabled" : false},
          "properties" : {
              "field_1" : { "type" : "string", "index" : "not_analyzed" }
          }
      }
  }
} '

也可以为一个已存在的Index创建特定的mapping:

curl -XPUT 'localhost:9200/twitter?pretty'
curl -XPUT 'localhost:9200/twitter/_mapping/tweet' -d ' 
{
    "tweet" : {
        "properties" : {
            "message" : { "type" : "string", "store" :  true}
        }
    }
} '

上例在twitter index中创建了一个名为tweet的mapping。这个mapping指明了： message field应该被存储（By default fields are not stored, just indexed）。

查询 Mapping

curl -XGET 'localhost:9200/twitter/_mapping/tweet?pretty'

Analysis and Analyzers

Analysis包括以下过程(由Analyzer完成)：

对一段文本进行分词

对分词后的terms进行normalize

Analyzer具有以下功能：

Character filter: 对string进行过滤，例如将&转换为and

Tokenizer: 对string进行分词

Token filter: 对分词后的terms进行过滤，例如转换成小写字母，删除stop words

ES内置的几种Analyzers:
· Standard analyzer
· Simple analyzer
· Whitespace analyzer
· Language analyzer

其中，Language analyzer 可以针对很多语言进行分词，包括中文。

何时使用Analyzer

When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index. However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.

Full-text queries, understand how each field is defined, and so they can do the right thing:

When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.

When you query an exact-value field, the query will not analyze the query string, but instead search for the exact value that you have specified.

什么是 full-text field 与 exact-value field ?

简单使用Analyzer

下面，我们将使用Standard Analyzer来对一句话进行分词：

不指定任何index

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'hello 100, this is me!'
--> 返回如下结果
{
  "tokens" : [ {
    "token" : "hello",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "100",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "<NUM>",
    "position" : 2
  }, {
    "token" : "this",
    "start_offset" : 11,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "is",
    "start_offset" : 16,
    "end_offset" : 18,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "me",
    "start_offset" : 19,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}

2. 对HTML进行分词

curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip&pretty' -d 'this is a <b>test</b>'
--> 返回如下结果
{
  "tokens" : [ {
    "token" : "this is a test",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "word",
    "position" : 1
  } ]
}

3. 指定一个特定的index进行分词

curl -XGET 'localhost:9200/test/_analyze?text=100+this+is+a+test+!'

当然，这要求test index存在，并且它含有analyzer（每个index都有一个默认的analyzer）。