ElasticSearch and file-contents

Hi guys

As far as I understand it’s already possible to index the entities of the media-package via Flowpack.ElasticSearch. So the metadata of files can be indexed, right?

Is there also a solution ready to index the content of the referenced file-resources (like pdf, word, …)?

Peter

1 Like

I am not aware of any. You need a text extractor (see for example https://tika.apache.org/) to do this.

Hello Christian

Thank you. I am a ware of that and this is the same server-component that our customer uses right now (TYPO3/SolR/tika-Solution running). We are offering a Neos solution for the relaunch and I need to estimate the effort to provide the same functionality.

I assume the way to go is to extend the code that indexes the media-asset-model to call tika and add the extracted file-contents to the index, too. Hope I am right with that?

Peter

Yes, sounds good. Depends obviously on how you want to access and use that data. You can also create a secondary index for file content.

Hi Peter

Have you done more research on this topic in the meantime? This is a very common requirement and I think an effort to have one community package to extend the Flowpack Search stuff would be great.

Have you read about the “ingest plugin”: https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

Maybe that could be a way to go, but I didn’t invest a lot of time in research.

Hey guys,

Hint: when you are using Neos and the file is referenced within a node, this is already possible: https://github.com/Flowpack/Flowpack.ElasticSearch.ContentRepositoryAdaptor#working-with-assets--attachments

@lorenzulrich, I did not do any further research because the feature was not booked by any of our customers yet. But meanwhile I already offered it in like 3 projects.
Once the first customer gives a go, I will go ahead with this.

@daniellienert, thank you for that hint. I saw that already, it did not work at our first try but maybe we need to try gain. I cannot remember what went wrong. We tried to test the feature in a timeboxed session where we want to find out “what is possible within like 2-3 hours”. We ended up with a working search (searches over nodes) but without assets-content-search, suggestion (autocomplete in search-field), facets (think this is called aggregations in ES).

As said, just waiting for the first customer to buy it. Or more time to try it out (which will not happen this year).