Elasticsearch special chars / umlauts

lmis · March 20, 2020, 2:59pm

Hallo everybody,

I’ve installed Elasticsearch on our website and finally everything works nearly well, except of search terms that contain special characters like ä, ö or ü. If the user searches for «Qualität» there’s no result, but this word definitely exists. I could replace the special chars with question marks as wildcard, but that would not be a satisfactory result because if the user searches for «für» also words like «Anforderungen» would be found.

I can’t find anything about special characters in the readme of the Elasticsearch extension. I would appreciate it if someone could help me to fix this problem. Is it an indexing problem or a search (term) problem?

daniellienert · March 21, 2020, 9:28am

Hey Michael,
Elasticsearch stores everything in utf8, and the package does no encoding changes at all. I never had the need to handle such characters specially so I guess the issue is somewhere in your project code.

Which Elasticsearch version are you using?
Have you checked if the data is correctly ingested and indexed?

Cheers,
Daniel

lmis · March 23, 2020, 9:06am

Hi Daniel,

I’ve located the problem. I’m adding an asterisk in front of the search term and at the end. I’m doing this because when the user searches for «Medien» also related words like «Printmedien» or «Medienproduktion» should be found. If there are no asterisks these word’s won’t be found. This is a specification from our marketing.

This is how I add the asterisks:

searchTerm = ${'*' + request.arguments.search + '*'}

Unfortunately this does not work with the search term «Qualität» and I don’t know why. If I remove my “asterisk adding” and add the asterisks manually into the search input it behaves like this:

medien
Pages with «Medien» or «Medien» with a hyphen in front of / at the end of the search term will be found.

*medien*
Everything will be found. «Medien», «Medienproduktion», «Printmedien», …

medien*
Pages with «Medien» and e.g. «Medienproduktion» will be found.

*medien
Pages with e.g. «Printmedien» and «Medien» will be found.

qualität
Pages with «Qualität» or «Qualität» with a hyphen in front of / at the end of the search term will be found.

*qualität*
No results.

*qualität
No results. Even though e.g. the word «Proofqualität» exists.

qualität*
Pages with «Qualität» and e.g. «Qualitätssicherung» will be found.

When I access Elasticsearch via Browser and e. g. enter «localhost:9200/_search?q=*qualität» he find’s a total of 94 hits and words like «Qualität», «Qualitätssicherung» and «Proofqualität». But searching with the same search term in the Neos frontend ends up in no results.

I’m wondering why are the results so different? Why will I find everything with «medien» but not everything with «qualität»? I cant believe that I’m the only one with this problem. How do you (or others) deal with this?

We’re using Neos 4.3 and Elasticsearch 5.6.16 and the standard configuration. I’ve also re-indexed the data.

lmis · March 24, 2020, 2:37pm

Can you please tell me how to check this? Furthermore: I know how to index the data, but how do I ingest it?

daniellienert · March 24, 2020, 3:14pm

To debug this I would do the following steps

Check the ingested data with Tools like Head or Kibana
Query the data manually (what you have already done, which also means #1 is correct)
Activate the query log of the CR Adapter by adding a .log() to the Eel that sends the query. This logs the original query to the Elasticsearch query log. Check this query. Manually execute this query and check the result.

As #1 and #2 seem correct. It might be that the compiled query is somehow broken.

lmis · March 25, 2020, 11:10am

I’ve installed Kibana, activated the logging and ran some tests.

If I search for «qualität» in our frontend I’ll get 37 results. I took the query that was sent to Elasticsearch (from the log) and checked it in Kibana. I also got 37 results. The only difference in the Elasticsearch query is that the «ä» is a unicode («\\u00e4»). In this case this is not a problem. But if I’m searching for «*qualität*» this is a problem. If I’ll change the unicode back to «ä» I’ll get results again. Please have a look at my screenshots. Maybe this is a bug?

I’m still using Elasticsearch 5.6.16 and Kibana 5.6.16.

daniellienert · March 26, 2020, 2:41pm

Off the top of my head I see two possible solutions.

Find out why the ‘ä’ character is passes as unicode string. That could be indeed a bug in CR-Adapater (but I actually do not see where this might happen), or in you application code.
Do not use wildcard search, which could be a performance issue anyway but us an n-gram analyzer instead.

lmis · March 27, 2020, 6:14am

Point 1:
I’ve located the problem. It’s the function «fulltext» in class «ElasticSearchQueryBuilder». The search term is okay so far, but this is the line that produces the error:

$this->request->fulltext(trim(json_encode($searchWord), '"'), $options);

As a matter of fact: it’s «json_encode» that changes the search term. After this it contains the unicode. You have to add «JSON_UNESCAPED_UNICODE» as option to prevent this. I’ve extended this class and fixed it. Should I open a new issue on Github?

Point 2:
How do I use an n-gram analyzer? Is there any documentation?

daniellienert · March 28, 2020, 7:57am

Adding issues is always nice. Providing PRs even nicer
As you already found the solution I added that easy bugfix and would do upmerges and releases as soon as the tests pass. (https://github.com/Flowpack/Flowpack.ElasticSearch.ContentRepositoryAdaptor/pull/317)

See: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

lmis · March 30, 2020, 6:26am

Thank you!

Okay, but how do I setup and execute this kind of search in Neos?

filip · December 7, 2020, 12:13pm

Same problem with Flow Queries and searching for Documents with a title equal to a non ASCII character