reindexer

Full text search with Reindexer

Reindexer has builtin full text search engine. This document describes usage of full text search.

LIKE

For simple search in text can be used operator LIKE. It search strings which match a pattern. In the pattern _ means any char and % means any sequence of chars.

    In Go:
    query := db.Query("items").
        Where("field", reindexer.LIKE, "pattern")

    In SQL:
    SELECT * FROM items WHERE fields LIKE 'pattern'
    'me_t' corresponds to 'meet', 'meat', 'melt' and so on
    '%tion' corresponds to 'tion', 'condition', 'creation' and so on

Define full text index fields

Full text search is performed in fields marked with text tag:

type Item struct {
    ID          int64  `reindex:"id,,pk"`
    Description string `reindex:"description,text"`
}

Full text search is also available for multiple fields of composite index marked with text tag:

type Item struct {
    ID          int64  `reindex:"id,,pk"`
    Name        string `reindex:"name,-"`
    Description string `reindex:"description,-"`
    _ struct{}         `reindex:"name+description=text_search,text,composite`
}

In this example full text index will include fields name and description,text_search is short alias of composite index name for using in Queries.

Full text index is case insensitive. The source text is tokenized to set of words. Word is sequence of any UTF-8 letters, digits or +, - or / symbols. First symbol of word is letter or digit.

Query to full text index

Queries to full text index are constructed by usual query interface

    query := db.Query ("items").
        Match ("name+description","text query","<stemmers>")

Or equivalent query using name alias:

    query := db.Query ("items").
        Match ("text_search","text query","<stemmers>")

Queries to full text index can be combined with conditions on another fields. e.g:

    query := db.Query ("items").
        Match ("description","text query").
        WhereInt("year",reindexer.GT,2010)

Each result of query contains rank of match. Rank is integer from 0 to 255. 0 - lowest relevancy, 255 - best relevancy. The query Iterator has method Rank(), which returns rank of current result

Note: <stemmers> see more Stemming.

Text query format

The format of query is:

query := [@[+]field[^boost][,field2[^boost]]] [=][*]term1[*][~][^boost] [+|-][*]term2[*][~][@][^boost] ...

Patterns

Field selection

Binary operators

Escape character

\ character allows to include any special character (+,-,@,*,^,~) to a word to be searched: \*crisis means to search literally *crisis and not all words that end with ‘crisis’.

The DSL’s operand can be a phrase: "word1 word2 ..."[~N]. In this case word1, word2, … are the words that make up the phrase in the given sequence. Words order is matter, i.e. sequence word2 word1 will not be found with "word1 word2" DSL.

~N - max distance (in words) between adjacent word positions. This argument is optional, default value for N is 1.

Synonyms of multiple words are not supported in the phrase.

Examples of text queris

Natural language processing

There are built in stemmers support in full text search. It enables natural language search of words with same stem. For example, query users will also match user. Stemmer is language specific, so it is neccessary to specify language of used stemmer.

All the available stemmers are in this [directory].(cpp_src/vendor/libstemmer/src_c)

Merging queries results

It is possible to merge multiple queries results and sort final result by relevancy.

    query := db.Query ("items").
        Match ("description","text query1")
    q2 := db.Query ("another_items").
        Match ("description","text query2")
    query.Merge (q2)
    iterator = query.Exec ()
    // Check the error
    if err := iterator.Error(); err != nil {
        panic(err)
    }
    // Iterate over results
    for iterator.Next() {
        // Get the next document and cast it to a pointer
        switch elem := iterator.Object().(type) {
            case Item:
                fmt.Printf ("%v,rank=%d\n",*elem,iterator.Rank())
            case AnotherItem:
                fmt.Printf ("%v,rank=%d\n",*elem,iterator.Rank())
        }
    }

Using select functions

It is possible to use select functions to process result data. For now you can use snippet, snippet_n and highlight, debug_rank. For composite indexes the result of the function will be written in to corresponding subfields. You can not put [,)\0] symbols in functions params. If the value contains special characters, it must be enclosed in single quotes.

Notice: although text indexes may be created over numeric fields, select functions can not be applied to any non-string field.

For all the functions there are two types of supported syntax with the same behavior: field.func_name(...) and field = func_name(...).

Highlight

This functions just highlights text area that was found. It has two arguments -

Example: word: “some text”

b.Query("items").Match("text", query).Limit(limit).Offset(offset).Functions("text.highlight(<b>,</b>)")

result: “some text

Snippet

Snippet highlights text area and erase other text. It has six arguments - last two is default

Example: word: “some text”

b.Query("items").Match("text", query).Limit(limit).Offset(offset).Functions("text.snippet(<b>,</b>,2,0)")

result: “e text

Snippet_n

Snippet_n highlights text area and erase other text. More flexible version of the snippet function. It has 4 position arguments and 5 named arguments. The named arguments are optional and can be passed in any order.

String values must be enclosed into single quotes.

Parameters’ names may be specified without quotes or in double quotes.

Numbers may be passed without quotes or in single quotes.

Examples: word: “some text string”

b.Query("items").Match("text", query).Limit(limit).Offset(offset).Functions("text.snippet_n('<b>','</b>',2,2,pre_delim='{',post_delim='}',with_area=1)")

result: “{[3,11]e text s}”

b.Query("items").Match("text", query).Limit(limit).Offset(offset).Functions("text.snippet_n('<b>','</b>',5,5,pre_delim='{',post_delim='}',left_bound='o',right_bound='i')")

result: “{me text str}”

Debug_rank

This function outputs additional information about ranking of the found word in the text in the key-value format. Returned format and content may vary depending on reindexer’s version. Works with text-index only.

Typos algorithm

Reindexer finds misspelled words by matching terms with deleted symbols. MaxTypos in fulltext index configuration limits how many symbols could be deleted in two words (in query and searching document). In each word could be deleted up to MaxTypos / 2 with rounding to a large value.

Typos handling details

Typos handling algorithm is language independant and simply checks possible word permutations. There are a banch of parameters to tune it:

More examples

MaxTypos = 1 - 1 symbol could be deleted. black and blaack matches if delete excess a in the second word. black and block do not match.

MaxTypos = 2 - by 1 symbol could be deleted in each word (up to 2 symbols in sum). black and blaack match if delete excess a in the second word. black and block match if delete a in the first word and o in the second word. black and blok do not match.

MaxTypos = 3 - 1 symbol in one word and 2 symbols in another one could be deleted (up to 3 symbols in sum). black and blok match if delete ac in the first word and o in the second word.

Performance and memory usage

Internally reindexer uses enhanced suffix array of unique words, and compressed reverse index of documents. Typically size of index is about 30%-80% of source text. But can vary in corner cases.

The Upsert operation does not perform actual indexing, but just stores text. There are lazy indexing is implemented. So actually, full text index is building on first Query on fulltext field. The indexing is uses several threads, so it is efficiently utilizes resources of modern multi core CPU. Therefore the indexing speed is very high. On modern hardware indexing speed is about ~50MB/sec

But on huge text size lazy indexing can seriously slow down first Query to text index. To avoid this side-effect it is possible to warmup text index: just by dummy Query after last Upsert

Configuration

Several parameters of full text search engine can be configured from application side. To setup configuration use db.AddIndex or db.UpdateIndex methods:

...
    ftconfig := reindexer.DefaultFtFastConfig()
    // Setup configuration
    ftconfig.LogLevel = reindexer.TRACE
    // Setup another parameters
    // ...
    // Create index definition
    indexDef := reindexer.IndexDef {
        Name: "description",
        JSONPaths: []string{"description"},
        IndexType: "text",
        FieldType: "string",
        Config: ftconfig,
    }
    // Add index with configuration
    return db.AddIndex ("items",indexDef)

Base config parameters

  Parameter name Type Description Default value
  Bm25Boost float Boost of bm25 ranking 1
  Bm25Weight float Weight of bm25 rank in final rank 0: bm25 will not change final rank. 1: bm25 will affect to fin l rank in 0 - 100% range. 0.1
  DistanceBoost float Boost of search query term distance in found document. 1
  DistanceWeight float Weight of search query terms distance in found document in final rank 0: distance will not change final rank. 1: distance will affect to final rank in 0 - 100% range. 0.5
  TermLenBoost float Boost of search query term length 1
  TermLenWeight float Weight of search query term length in final rank. 0: term length will not change final rank. 1: term length will affect to final rank in 0 - 100% range 0.3
  PositionBoost float Boost of search query term position 1.0
  PositionWeight float Weight of search query term position in final rank. 0: term position will not change final rank. 1: term position will affect to final rank in 0 - 100% range 0.1
  FullMatchBoost float Boost of full match of search phrase with doc 1.1
  PartialMatchDecrease int Decrease of relevancy in case of partial match by value: partial_match_decrease * (non matched symbols) / (matched symbols) 15
  MinRelevancy float Minimum rank of found documents. 0: all found documents will be returned 1: only documents with relevancy >= 100% will be returned 0.05
  MaxTypos int Maximum possible typos in word. 0: typos are disabled, words with typos will not match. N: words with N possible typos will match. Check typos handling section for detailed description. 2
  MaxTyposInWord int Deprecated, use MaxTypos instead of this. Cannot be used with MaxTypos. Maximum possible typos in word. 0: typos is disabled, words with typos will not match. N: words with N possible typos will match. It is not recommended to set more than 1 possible typo -It will seriously increase RAM usage, and decrease search speed -
  MaxTypoLen int Maximum word length for building and matching variants with typos. 15
  FtTyposDetailedConfig struct Config for more precise typos algorithm tuning  
  MaxRebuildSteps int Maximum steps without full rebuild of ft - more steps faster commit slower select - optimal about 15. 50
  MaxStepSize int Maximum unique words to step 4000
  MergeLimit int Maximum documents count which will be processed in merge query results. Increasing this value may refine ranking of queries with high frequency words, but will decrease search speed 20000
  Stemmers []string List of stemmers to use. Available values: “en”, “ru”, “nl”, “fin”, “de”, “da”, “fr”, “it”, “hu”, “no”, “pt”, “ro”, “es”, “sv”, “tr” “en”,”ru”
  EnableTranslit bool Enable russian translit variants processing. e.g. term “luntik” will match word “лунтик” true
  EnableKbLayout bool Enable wrong keyboard layout variants processing. e.g. term “keynbr” will match word “лунтик” true
  StopWords []struct List of objects of stopwords. Words from this list will be ignored when building indexes, but may be used in fulltext queries (such as ‘word*’, ‘word~’ etc) and produce non-empty search results. More…  
  SumRanksByFieldsRatio float Ratio of summation of ranks of match one term in several fields 0.0
  LogLevel int Log level of full text search engine 0
  FieldsCfg []struct Configs for certain fields. Overlaps parameters from main config. Contains parameters: FieldName, Bm25Boost, Bm25Weight, TermLenBoost, TermLenWeight, PositionBoost, PositionWeight. empty
  ExtraWordSymbols string Extra symbols, which will be threated as parts of word to addition to letters and digits ”+/-“
  MaxAreasInDoc int Max number of highlighted areas for each field in each document (for snippet() and highlight()). ‘-1’ means unlimited 5
  MaxTotalAreasToCache int Max total number of highlighted areas in ft result, when result still remains cacheable. ‘-1’ means unlimited -1
  Optimization string Optimize the index by ‘memory’ or by ‘cpu’ “memory”
  FtBaseRanking struct Relevance of the word in different forms  
  Bm25Config struct Document ranking function parameters More…  
  SplitterType string Text breakdown algorithm. Available values: ‘mmseg_cn’ and ‘fast’ “fast”

Stopwords details

The list item can be either a string or a structure containing a string (the stopword) and a bool attribute (is_morpheme) indicating whether the stopword can be part of a word that can be shown in query-results. If the stopword is set as a string, then the is_morpheme attribute is false by default and following entries are equivalent:

"stop_words":[
    {
        "word": "some_word",
        "is_morpheme": false
    },
    ///...
]

,

"stop_words":[
    "some_word",
    ///...
]

Example:

If the list of stopwords looks like this:

"stop_words":[
    {
        "word": "under",
        "is_morpheme": true
    },
    ///...
]

and there are pair of documents containing this word: {"...under the roof ..."}, {"... to understand and forgive..."}. Then for the query ‘under*’ we will get as a result only document {"... to understand and forgive..."} and for the query ‘under’ we will get nothing as a result.

If the “StopWords” section is not specified in the config, then the default_en and default_ru stopwords list will be used, and if it is explicitly specified empty, it means that there are no stopwords.

Detailed typos config

FtTyposDetailedConfig: config for more precise typos algorithm tuning.

  Parameter name Type Description Default value
  MaxTypoDistance int Maximum distance between symbols in initial and target words to perform substitution. Check typos handling section for detailed description. 0
  MaxSymbolPermutationDistance int aximum distance between same symbols in initial and target words to perform substitution (to handle cases, when two symbolws were switched with each other). Check typos handling section for detailed description. 1
  MaxMissingLetters int Maximum number of symbols, which may be removed from the initial term to transform it into the result word. Check typos handling section for detailed description. 2
  MaxExtraLetters int Maximum number of symbols, which may be added to the initial term to transform it into the result word. Check typos handling section for detailed description. 2

Base ranking config

FtBaseRanking: config for the base relevancy of the word in different forms.

  Parameter name Type Description Default value
  FullMatch int Relevancy of full word match 100
  PrefixMin int Mininum relevancy of prefix word match. 50
  SuffixMin int Mininum relevancy of suffix word match. 10
  Typo int Base relevancy of typo match 85
  TypoPenalty int Extra penalty for each word’s permutation (addition/deletion of the symbol) in typo algorithm. The minimum rank after applying penalties will be at least 1. 15
  StemmerPenalty int Penalty for the variants, created by stemming. The minimum rank after applying penalties will be at least 1. 15
  Kblayout int Relevancy of the match in incorrect kblayout 90
  Translit int Relevancy of the match in translit 90
  Synonyms int Relevancy of synonyms match 95

Text splitters

Reindexer supports two algorithms to break texts into words: fast and mmseg_cn.

Default fast algorithm is based on the definition of a word in the form of a alpha (from supported Unicode subset), number and an extended character, everything else (whitespaces, special characters, unsopported Unicode subsets, etc) will be threated as a delimiters.

Reindexer supports the following unicode block codes / extra symbols:

This algorithm is simple and provides high performance, but it can not handle texts without delimiters (for example, Chinese language does not requires whitespaces between words, so fast-splitter will not be able to index it properly).

Alternative mmseg_cn-splitter is based on friso implementation of mmseg algorithm and uses dictionaries for tokenization. Currently this splitter supports only Chinese and English languages.

Basic document ranking algorithms

For basic document ranking, one of the following algorithms can be used:

Calculation formula for bm25:

R = (log(totalDocCount / (matchedDocCount + 1)) + 1) * termCountInDoc / wordsInDoc * (k1 + 1.0) / (termCountInDoc / wordsInDoc + k1 * (1.0 - b_ + b_ * wordsInDoc / avgDocLen))

Calculation formula for bm25_rx:

R = (log(totalDocCount / (matchedDocCount + 1)) + 1) * termCountInDoc * (k1 + 1.0) / (termCountInDoc + k1 * (1.0 - b_ + b_ * wordsInDoc / avgDocLen))

Calculation formula for word_count:

R = termCountInDoc
  Parameter name Type Description Default value
  Bm25k1 float Coefficient k1 in the formula for calculating bm25 (used only for rx_bm25, bm25). 2.0
  Bm25b float Coefficient b in the formula for calculating bm25 (used only for rx_bm25, bm25). 0.75
  Bm25Type string Formula for calculating document relevance (rx_bm25, bm25, word_count) “rx_bm25”

Limitations and know issues