ElasticSearch and file-contents

drillsgt · September 22, 2016, 5:35am

Hi guys

As far as I understand it’s already possible to index the entities of the media-package via Flowpack.ElasticSearch. So the metadata of files can be indexed, right?

Is there also a solution ready to index the content of the referenced file-resources (like pdf, word, …)?

Peter

christianm · September 22, 2016, 9:05am

I am not aware of any. You need a text extractor (see for example https://tika.apache.org/) to do this.

drillsgt · September 22, 2016, 10:11am

Hello Christian

Thank you. I am a ware of that and this is the same server-component that our customer uses right now (TYPO3/SolR/tika-Solution running). We are offering a Neos solution for the relaunch and I need to estimate the effort to provide the same functionality.

I assume the way to go is to extend the code that indexes the media-asset-model to call tika and add the extracted file-contents to the index, too. Hope I am right with that?

Peter

christianm · September 22, 2016, 2:20pm

Yes, sounds good. Depends obviously on how you want to access and use that data. You can also create a secondary index for file content.

lorenzulrich · October 30, 2017, 7:26pm

Hi Peter

Have you done more research on this topic in the meantime? This is a very common requirement and I think an effort to have one community package to extend the Flowpack Search stuff would be great.

Have you read about the “ingest plugin”: https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

Maybe that could be a way to go, but I didn’t invest a lot of time in research.

daniellienert · October 30, 2017, 7:39pm

Hey guys,

Hint: when you are using Neos and the file is referenced within a node, this is already possible: https://github.com/Flowpack/Flowpack.ElasticSearch.ContentRepositoryAdaptor#working-with-assets--attachments

drillsgt · November 3, 2017, 2:30pm

@lorenzulrich, I did not do any further research because the feature was not booked by any of our customers yet. But meanwhile I already offered it in like 3 projects.
Once the first customer gives a go, I will go ahead with this.

@daniellienert, thank you for that hint. I saw that already, it did not work at our first try but maybe we need to try gain. I cannot remember what went wrong. We tried to test the feature in a timeboxed session where we want to find out “what is possible within like 2-3 hours”. We ended up with a working search (searches over nodes) but without assets-content-search, suggestion (autocomplete in search-field), facets (think this is called aggregations in ES).

As said, just waiting for the first customer to buy it. Or more time to try it out (which will not happen this year).