小皮博客 | Xiaopi's Blog

78-ElasticSearch组合实现pinyin中文分词及自动补全

需要组合Elastic Search的中文分词,拼音,自动补全,错别字或者英文拼写错误等功能。还要支持热点词。

IK中文分词

安装

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.3/elasticsearch-analysis-ik-5.5.3.zip # 需要安装和es对应一致的IK版本。

配置

ik的配置文件位置在 es/config/analysis-ik
vim IKAnalyzer.cfg.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

基本操作

测试分词器效果

  • ik_max_word: 尽量多的分词
1
2
3
4
curl -XGET 'http://localhost:9200/_analyze/?pretty' -d '{
"analyzer":"ik_max_word",
"text":"中华人民共和国国歌"
}'
  • 返回值如下
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和国",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和国",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "国歌",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}
  • ik_smart智能分词
1
2
3
4
curl -XGET 'http://localhost:9200/_analyze/?pretty' -d '{
"analyzer":"ik_smart",
"text":"中华人民共和国国歌"
}'
  • 返回值
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "国歌",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

创建一个使用中文分词器的索引

查看当前索引

curl -XGET ‘localhost:9200/_cat/indices?v&pretty’ # 查看当前索引

创建空索引

curl -XPUT ‘localhost:9200/zhongwen/?pretty’

设置映射类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
curl -XPOST 'localhost:9200/zhongwen/news/_mapping?pretty' -d '{
"zhongwen":{
"_all":{
"analyzer":"ik_max_word",
"search_analyzer":"ik_max_word",
"term_vector":"no",
"store":"false"
},
"properties":{
"content":{
"type":"string",
"store":"no",
"term_vector":"with_position_offsets",
"analyzer":"ik_max_word",
"search_analyzer":"ik_max_word",
"include_in_all":"true",
"boost":8
}
}
}
}'
```

## 测试
### 插入数据
curl -XPOST 'http://localhost:9200/zhongwen/news/?pretty' -d'{"content":"刘德华"}'
curl -XPOST 'http://localhost:9200/zhongwen/news/?pretty' -d'{"content":"中华人民共和国国歌"}'

### 测试
TODO 待补充

# CRF中文分词
TODO 待补充

# pinyin插件
## 安装
> ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v5.5.3/elasticsearch-analysis-pinyin-5.5.3.zip

## 配置
TODO 好像不需要配置

## 基本操作
### 创建pinyin索引
```shell
curl -XPUT 'http://localhost:9200/pinyin_plug_test_index/?pretty' -d '
{
"index" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}'

测试验证

基本测试

1
2
3
4
5
curl -XGET 'http://localhost:9200/pinyin_plug_test_index/_analyze?pretty' -d '
{
"text": ["刘德华"],
"analyzer": "pinyin_analyzer"
}'

全部 TODO 参考官方文档 https://github.com/medcl/elasticsearch-analysis-pinyin

自动补全

热点词

错别字纠错插件

整合pinyin及ik中文

安装及配置

参见前文

索引创建及操作

创建索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
curl -XPUT 'http://localhost:9200/ik_plus_pinyin/?pretty' -d '
{
"index" : {
"analysis" : {
"analyzer" : {
"ngram_pinyin_analyzer" : {
"tokenizer" : "keyword",
"filter" : ["full_pinyin_with_space","word_delimiter","shingle","remove_whitespace"]
}, "my_pinyin_analyzer" : {
"tokenizer" : "keyword",
"filter" : ["full_pinyin_no_space"]
}
},
"filter" :{
"full_pinyin_no_space" : {
"type" : "pinyin",
"first_letter" : "none",
"padding_char" : ""
},"full_pinyin_with_space" : {
"type" : "pinyin",
"first_letter" : "none",
"padding_char" : " "
},
"my_edge_ngram_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "2",
"max_gram" : "5",
"token_chars": [ "letter", "digit" ]
},
"remove_whitespace": {
"type": "pattern_replace",
"pattern": "\\s+",
"replacement":""
}
}
}
}
}'

创建type的mapping

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
curl -XPOST "http://localhost:9200/ik_plus_pinyin/keyword/_mapping?pretty" -d '
{
"properties": {
"name1":{
"type": "text",
"fields": {
"pinyin":{
"type": "string",
"analyzer": "ngram_pinyin_analyzer"
}, "full_pinyin":{
"type": "string",
"analyzer": "my_pinyin_analyzer"
},
"first_letter":{
"type": "string",
"analyzer": "pinyin"
},
"name1":{
"type": "string",
"analyzer": "ik_max_word"
}
}
}
}
}'

基本操作

构造数据

curl -XPOST http://localhost:9200/ik_plus_pinyin/keyword/?pretty -d’{“name1”:”刘德华”}’
curl -XPOST http://localhost:9200/ik_plus_pinyin/keyword/?pretty -d’{“name1”:”中华人民共和国国歌”}’

构造测试

构造一个带权重的测试

  • 组合了各种情况,拼音,首字母及汉字混合的情况。可以将query字段取值为
    [“ldh”, “ldehua”, “liu”, “dehua”, “liu hua”, “hua”, “刘”, “德华”, “华”, “l德hua”]
1
2
3
4
5
6
7
8
9
10
curl -XPOST 'http://localhost:9200/ik_plus_pinyin/keyword/_search?size=50&pretty' -d '
{
"query": {
"query_string": {
"fields": ["name1^100","name1.full_pinyin^30","name1.pinyin^20","name1.first_letter^10"],
"query": "l德hua",
"default_operator": "OR"
}
}
}'
  • 返回
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
"took" : 92,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 80.25915,
"hits" : [
{
"_index" : "ik_plus_pinyin",
"_type" : "keyword",
"_id" : "AWbF8CkihRb6RLuoMFEG",
"_score" : 80.25915,
"_source" : {
"name1" : "刘德华"
}
},
{
"_index" : "ik_plus_pinyin",
"_type" : "keyword",
"_id" : "AWbF8CyChRb6RLuoMFEH",
"_score" : 4.598851,
"_source" : {
"name1" : "中华人民共和国国歌"
}
}
]
}
}

版权声明

本文标题:78-ElasticSearch组合实现pinyin中文分词及自动补全

文章作者:盛领

发布时间:2018年10月29日 - 23:48:59

原始链接:http://blog.xiaoyuyu.net/post/dd5c815a.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

如您有任何商业合作或者授权方面的协商,请给我留言:sunsetxiao@126.com

盛领 wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
坚持原创技术分享,您的支持将鼓励我继续创作!