Discussion editing and storing thousands of blog-posts/news/...entries in nodes in Neos 9.0

Given one has a large set of nodes of a certain type like blog entries or news entries.
These entries are likely directly children to the respective blog or news root node.
Further hierarchy might be introduced by adding categories but those are better handled by references to solve the problem of one entry being related to two categories.

In the NeosIo we handle this ourselves like that. Flat and simple. That works well for everyone until the amount of entries grows to the extend that the NeosUi cannot load them all at once and either crashes or becomes to slow to be unusable.

Current state

A plain Neos solution would involve to create document nodes to group posts by year or even month and day. By adjusting the loading depth in the Neos Ui to only allow the folders to be initially visible the Ui feels navigable again and entries will only be loaded on demand when opening the node in the tree. These kind of folders might be shortcut-nodes and link back to the actual blog or could even show an overview of all the entries for that date and thus give additional value when using the blog.
The problem with the additional hierarchy is just that it has to be maintained by the redactor. This is especially painful work when the entries should be ordered by publishing date which is not known during creation and thus one might need to shuffle the post around multiple times.

To solve this problem Neos 8 has a few third party solutions:

1. Create the folder structure during item creation

This plugin GitHub - punktDe/archivist: A NeosCMS package that automatically sorts nodes into a predefined structure which is created on the fly allows to automate the step of sorting the nodes correctly.

Automating the sorting seems a good idea but only if you know that otherwise one would need to sort the nodes by hand. From an outside view the magic sorting is an anti pattern to the Neos Ui in the following ways:

  • During node creation one needs to select a parent and also into/after or before. Now both decisions are irrelevant and will be ignored due to the sorter but are still promoted by the Neos ui
  • Resorting nodes after changing the property its sorted upon might not correctly be reflected in the document tree and if it would its also counterintuitive that this node got physically moved.
  • Problems with the parent folder structure which is created in the user workspace (which it has to in Neos9 to avoid having to resync from the absolute base live workspace)
    • It has to be published with the new creation of the entry. A new child cannot be published without its new parent.
    • But if another redactor would also create a post which requires the same new structure then there might be duplicate folder nodes or conflicts when using deterministic node ids.
  • The automatic created folder structure is mutable and a redactor can accidentally move a folder rename it or duplicate (copy paste) it where it doesnt belong.

2. Keep a flat navigation but paginate → Exclude items from normal document tree

This GitHub - psmb/Psmb.FlatNav: Custom flat navigation component for Neos CMS offers a dedicated tab next-to the document tree to only show nodes of a certain kind. They are paginated to some extend (i believe just in PHP itself so no database advantages) but that way the Neos Ui is not overloaded with data and usable.

Note: the plugin https://github.com/Sebobo/Shel.Neos.SubTree is similar but instead shows the full subtree from a giving point which would not make a huge list any better navigate-able.

Discussing the future UX with Neos 9…

Where do we want to go with that? We should be able to handle a big amount of entries … right?

From the editors view we can be sure that with an amount of thousands posts there needs to be a folder like navigation. Pagination could also make sense on-top of that but one needs to be able to navigate from entries to one year to entries to another year and then go to a month and paginate from there.

But this begs the question do the nodes have to be stored in that year/month hierarchy?
Inside the team - and also on con stage - we did discuss the idea of “virtual nodes” to extend the tree and group nodes. Now we are discarded that exact implementation idea but we still experiment and discuss that not actual nodes must necessarily make up the tree. If we are able to tell the Ui to just show node entries for all the years and let it know that these entries are not real nodes and cannot be navigated to really we would have exactly that.

Neos 9 content graph performance

The navigation should be performant so it scales also when up to hundert of thousand of nodes are used.

So another aspect we might not be able to leave out is that the doctrine content graph is not build for optimised pagination or might not be able to efficiently implement the virtual navigation throughout the structure imposed by node properties like a postPublishDate property.

At first for the editing we would need to find out all possible years as they are the topmost navigation. But to answer that exactly we would need to fetch ALL entries from the content graph and get the involved years in PHP. As an alternative we could just show navigation entry points for the last 5 or more years.

This is the visual structure:

blog-root/
├─ 2025/
├─ 2024/
/// ...

Now upon opening a certain year we might want to show all months.
If we want to only show months with posts inside we would need to query all entries of this year. For now again we would show all predefined 12 months.

blog-root/
├─ 2025/
   ├─ 01/
   ├─ 02/
   ├─ 03/
   ├─ 04/
   ├─ 05/
    /// ...
├─ 2024/
/// ...

Now upon opening a month we cannot deny any longer that we need to query the graph.

We can use a findChildNodes() query with a propertyValue filter.

$itemsFromJanuary2025 = $subgraph->findChildNodes(
    parentNodeAggregateId: 'blog-root',
    filter: FindChildNodesFilter::create(
        propertyValue: 'postPublishDate >= "2025-01-01" AND postPublishDate < "2025-02-01"'
        ordering: Ordering::byProperty('postPublishDate', OrderingDirection::ASCENDING),
        // pagination with limit
    )
);

To hopefully show the entries.

blog-root/
├─ 2025/
   ├─ 01/
       ├─ Difference of bears to beer or berries explained
       ├─ How to cuddle with bears
       ├─ How to hug bears
       ├─ How to learn bearish
       ├─ How to make honey
       ├─ Baking together with a bear
        /// ... possibly a pagination if too many entries
   ├─ 02/
   ├─ 03/
   ├─ 04/
   ├─ 05/
    /// ...
├─ 2024/
/// ...

But is this performant? The resulting query would be this:

now the dbal content graph has joins the node and hierarchies and dwell through all the json node property fields to determine if the nodes should be included in the response.

As we dont (or cant even?) apply indexing to the generic properties this i wouldnt expect this query to perform well.

SELECT n.*, h.subtreetags
FROM cr_default_p_graph_node pn
         INNER JOIN cr_default_p_graph_hierarchyrelation h ON h.parentnodeanchor = pn.relationanchorpoint
         INNER JOIN cr_default_p_graph_node n ON h.childnodeanchor = n.relationanchorpoint
WHERE (pn.nodeaggregateid = :parentNodeAggregateId)
  AND (h.contentstreamid = :contentStreamId)
  AND (h.dimensionspacepointhash = :dimensionSpacePointHash)
  AND ((JSON_EXTRACT(n.properties, '$.\"postPublishDate\".value') >= '2025-01-01') AND
       (JSON_EXTRACT(n.properties, '$.\"postPublishDate\".value') < '2025-02-01'))
  AND (NOT JSON_CONTAINS_PATH(h.subtreetags, 'one', '$."disabled"'))
  AND (NOT JSON_CONTAINS_PATH(h.subtreetags, 'one', '$."removed"'))
ORDER BY JSON_EXTRACT(n.properties, '$.\"postPublishDate\".value') ASC, h.position ASC

From that perspective we might even query the graph already to find all the possible months via propertyValue: 'postPublishDate >= "2025-01-01" and postPublishDate < "2026-01-01"' as that might only be little slower as more data is returned but the database has to do an equal amount of work right?

Using here an actual node structure of year and month nodes should perform better as we would simplify fetch all children from a month node without needing to filter anything.

Final question of performance

Can we implement a virtual navigation efficiently? Querying would have to do always a lot of work versus we promote to use structured nodes by year/month and introduce a command-hook to sort nodes upon creation into the right folder (see 2.). What is in your opinion the way for 9.0 and where do we want to go in the future.

I currently experimented with using real node structures with 10K items in one dimension and the performance is okay and will attempt to benchmark alone the queries if all nodes are flat.

I’ve already given this some thought as well, although in our projects the things our editors manage and the things we have many of usually are completely disparate things.

First, I’d like to distinguish between nodes and document/content tree items; they match pretty well by default, but in the case described here the folder items are actually just query filters. They should not have any effect on the blog posting URIs (unless they do, but then they are real folders and we have no problem) and should not appear in the frontend view either.
The way stuff is structured in the CR should solely be determinded by semantics, while the UI should be structured the way editors can work with best.

Which leads us to query performance.
Everything we want to do, be it sorting, filtering, pagination (which is basically just sorting and filtering) relies on one database feature: indexes.

Indexes

The first thing about indexes is that we need some (easy!) way to declare them. ORM does this via PHP attributes, we could do the same in the NodeType config, be it YAML or PHP attributes as well (I’m looking at you, OPGM).
From there the natural way would be to teach ./flow cr:setup to setup/remove the declared indexes. The rest is then up to the adapters, which have to apply and actually use these indexes; their API could stay the same.

MariaDB / MySQL

Indexing JSON is almost impossible here, unless you do some dark magic as Eric demonstrated in his KISSSearch talk at NeosCon. Whether we use virutal fields or let the adapter create new real indexed columns on demand is tbd.

Postgres

Could have an easier time with this because of it being awesome and having JSONB indexes

Neo4J / graph databases in general

Should have no issue with this as they pretty much exactly match our (property) graph model data model and support indexes as well.

UI integration

from there, we “just” need a proper way to declare custom tree / pagination / you name it views. Ideally from YAML/PHP as in my experience this is much easier than writing UI plugins and the use cases should be pretty well defined by now.

1 Like

Benching the performance (Neos 9.0.7)

What i feared is true. On mariadb (11.0.5-MariaDB) tested locally i get the following results:

Import of a subset of 30.000 posts into live without any existing content or other workspaces.
Each document has a main content collection so its always two nodes.

benchmark for import:

deep structure by year and month: 2m 8seconds
flat structure: 3m 4s

its ironic the flat structure already takes longer to import even though we have to create additional folder nodes in the deep structure.


Now the cruel query test.

In my dataset the month June of 2025 contains 294 posts.

For the case with the deep structure i just fetch all children of the node for month June.
For the flat case i use the findChildNodes filtering by date property.

[1ws] deep structure by year and month: 0.014-0.020 seconds
[1ws] flat structure: 0.338-0.351 seconds

The nested structure 24 times faster to query!!!

Add 6 workspaces = 7 with live in total

[7ws] deep structure by year and month: 0.016-0.022 seconds
[7ws] flat structure: 0.498-0.753 seconds

Add 6 workspaces again = 13 with live in total

[13ws] deep structure by year and month: 0.015-0.028 seconds
[13ws] flat structure: 0.679-0.753 seconds


How can we make the flat structure faster?

so the above timings show its not a good idea to put a lot of nodes on the same level.
And the query time gets really bad with more and more editors already getting almost into the second with 12 workspaces at 60.000 nodes each.

Its not so much the property filter that makes the querying slow. The slownes is also notable when simply even querying the entries with a LIMIT to get the first 294 ones:

$items = $subgraph->findChildNodes(
    parentNodeAggregateId: 'blog-root',
    filter: FindChildNodesFilter::create(
        pagination: Pagination::fromLimitAndOffset(294, 0)
    )
);

a bench shows that is even worse than a property filter:

[13ws] flat structure (get arbitrary first 294 entries): 0.810-0.853 seconds

edit
querying the first 294 is slow but only because of the default ordering of ORDER BY h.position ASC, if we omit this in a custom query we get:

[13ws] flat structure (get arbitrary first 294 entries without ordering): pure SQL query time ~0.300 seconds

finding all entries is a little slow still but there is no relation that a querying a hundreds of all takes half the time:

[13ws] flat structure (get all 30.000 entries): 1.362-1.411 seconds

[13ws] flat structure (COUNT(*) all 30.000 entries): 0.460-0.479 seconds

Thanks for bringing this up. I cannot contribute to the technical side of this, however, I’d like to point out that the given example of News and Blogs is only a subset of the cases in which you need a “flat navigation”.

While News and Blogs could be grouped by year/month etc., there are other cases where you might have many document nodes, e.g.

  • People’s profile pages
  • Courses

So I think the case of having a performant, paginated, flat navigation for entries that would make the normal document tree unusable is valid in any case. As you know, we do this with Psmb.FlatNav. This solution is OK, but of course it misses some useful features like filtering for properties.

I think automated grouping for News records can be a nice feature, but not every integrator will want to use this even for news. It reflects the news date in the URL structure, which is something for example most newspapers don’t do. They normally use the combination of title and an ID.

(As mentioned in another context, hiding such nodes from the document tree will lead to them not being linkable from the new link editor. With the old link editor, this worked because it was just using auto-suggest which was inconvenient, but complete independently of the tree configuration.)

1 Like