全文检索的几个重要概念: Analyzer, tokenizer, token filter, char filter

2012-06-26 19:44

访问量: 2652

Analyzer:

The index analysis module acts as a configurable registry of Analyzers that can be used in order to both break indexed (analyzed) fields when a document is indexed and process query strings. It maps to the Lucene Analyzer.

index analysis module 就是一个可以配置的registry(登记处），在这里有很多的Analyzer. 每个Analyzer都可以在建立索引时，把document 划分成若干个index，或者用于查询索引。它(没看明白是 analysis 还是 analyzer) 对应于Lucene当中的 Analyzer.

Analyzers are (generally) composed of a single Tokenizer and zero or more TokenFilters.

Anaylyzer通常由一个Tokenizer ，以及若干个 TokenFilter 组成。

A set of CharFilters can be associated with an analyzer to process the characters prior to other analysis steps.
一个Analyzer 可以对应与多个 CharFilter. CharFilter可以在其他Analysis步骤之前执行。

Char filters allow one to filter out the stream of text before it gets tokenized (used within an Analyzer).

An analyzer of type whitespace that is built using a Whitespace Tokenizer.
A tokenizer of type whitespace that divides text at whitespace.
空格Analyzer就是使用了空格tokenizer来建立的。
空格tokenizer: 就是用空格来划分文本的。

Snowball Analyzer
An analyzer of type snowball that uses the standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter.

The Snowball Analyzer is a stemming analyzer from Lucene that is originally based on the snowball project from snowball.tartarus.org.

Snowball 分析器：使用了标准分词器。(standard tokenizer) , 标准过滤器，小写字母过滤器， stop filter, snowball filter.
Snowball分析器起源于 lucene 的snowball项目，后者是 snowball.tartarus.org

keyword analyzer:
An analyzer of type keyword that “tokenizes” an entire stream as a single token. This is useful for data like zip codes, ids and so on.
把某个整串的字符视为一个单独的token. 这个analyzer对于邮编,id 等的分析非常有效。

Stop Analyzer:
An analyzer of type stop that is built using a Lower Case Tokenizer, with Stop Token Filter.
Stop 分析其：使用了 Lower Case tokenizer 和 stop token filter。(后者是去掉字符串中的 stop word，然后再进行分析）

Language Analyzers:
A set of analyzers aimed at analyzing specific language text. The following types are supported: arabic, ... chinese, ... thai.
一组针对某些特定语言的 analyzer.

Custom Analyzer
An analyzer of type custom that allows to combine a Tokenizer with zero or more Token Filters, and zero or more Char Filters. The custom analyzer accepts a logical/registered name of the tokenizer to use, and a list of logical/registered names of token filters.

定制的分析器：由一个tokenizer, 任意个token filter, 任意个 char filter组成。
例子见： http://www.elasticsearch.org/guide/reference/index-modules/analysis/custom-analyzer.html

订阅/RSS Feed