Search: prevent particular page from indexing

Hi together,
I’m searching best practice to prevent a particular Page “like a «thank your for contact»-Site” from search-indexing. (I’m using flowpack/simplesearch)

###Questions

  • Does anyone know a simple way to exclude a bunch of websites?
    Easiest would be to hide from indexing all with (default Page-property) “_hiddenInIndex == true”

  • Is there a built-in-way to exclude all «Hide from search engines (noindex)» from internal-indexing?

  • Maybe Does anyone know a way to exclude from indexing-process with a particular nodeTypeProperty?
    (So, it would be possible to extend the default ‘Neos.NodeTypes:Page’ with an property like «Hide from internal Indexing»)


###Currently approach
I have defined a nodeType PageNotInSearchIndex.yaml:

# Document-Type like default document but disbled from indexing from search
'Vendor.Site:.PageNotInSearchIndex':
  superTypes:
    'Neos.NodeTypes:Page': TRUE
  ui:
    label: 'Page* (prevent from search-indexing)'
    icon: 'icon-search-minus'
    group: general
  search:
    fulltext:
      isRoot: FALSE
      enable: FALSE
```
Maybe this is a too simplistic (or wrong) solution. But currently it looks like the pages are excluded from the process `./flow nodeindex:build`

Neos.Seo has a metaRobotsNoindex property that i use for such purposes since you usually want to exclude stuff from internal and external indexing at the same time.

Hi @mficzel,
how do you use metaRobotsNoindex property (Hide from search engines (noindex)) to exclude from internal indexing?

Yes … i basically exclude the same stuff from internal indexing as from external. Editors only have to learn one thing. I cannot remember a sane reason for separating this :slight_smile:

Me too :wink:.
Maybe you could explain, how you exclude those «metaTobotsNoindex == true»-Pages from internal indexing? Sorry, but I can’t find a solution. Do you add vars to the end of ./flow nodeindex:build?

If you have elasticsearch you can add a must_not clause to the default query configuration.

That way those documents are excluded from search results and not from indexing.

For simplesearch you probably have to add this to the query manually.

Thank you @mficzel!

@christianm sorry for direct question – Have problem to analyze the Plugin-Code.

  • Is in SimpleSearch.ContentRepositoryAdaptor a default query configuration usable?

Currently not, would be a new Feature to add. Didn’t know that existed for ElasticSearch.

@christianm: thanks for answer.
Will use my approach or maybe test manually query-filtering.
With exclude from indexing the cost could be less. Because only once a night indexing exactly the needed pages could be faster/smaller indexing-file and also faster without query-filtering. Maybe :pensive: