RFC: Non-UUID node identifiers

Hi there,

Motivation

In recent projects, I regularly and increasingly encountered the need for meaningful, predictable node identifiers. Say for example you modeled your product catalog as nodes (which imho makes total sense regarding translations and other dimensions, workspaces etc.) and have related entities (import commands from external systems, quotable offers etc.) which need to be able to retrieve nodes from a given context (if not even directly from the NodeDataRepository) but are unaware of internal node identifiers that have been defined as UUIDs.

Status quo

In the current implementation, it is possible to use custom identifiers, although by some ugly means (AOP, custom reference inspector, custom LinkingService to name some), because here and there, some rather half-heartedly implemented restrictions are in place. The reference editor shortens IDs to 40 characters, the NodeData implementation validates strings in reference properties against the UUID pattern (but only on publish, not on setProperty) etc.

The plan

I’d love to see those restrictions gone for good. While UUIDs are a totally sane, sensible fallback if no other logic is in place, I don’t see the need for them being enforced. Node identifiers only have to be unique within the current subgraph (aka context), for uniqueness in the whole graph we have the variant identifier (currently persistence object identifier). The other way around, during import I can easily use the same UUID twice to break the system anyways. UUIDs by themselves are no guarantee for uniqueness whatsoever.

Thoughts? Opinions?
Should be non-breaking and easily implemented, I’d like to target 3.1 for this

1 Like

Just a quick question: When i have to find imported nodes again i usually use a custom node-name instead of modifying the identifier. That worked well for me until now. Would that cover your use case aswell?

Doesn’t help if you don’t know where in the structure the node is. I have had that case as well.

I would like to have this as well, but I say this needs to be tackled very carefully. Needs tests and overall a good thinking through. But yes, would make a lot of things easier.

I like the idea, but to me it feels like an overlap between node name and -identifier and I hope that we can unify these somehow.

RE: Node names:
What Christian said.
Unlike node identifiers, node names (aka the last segment of a node’s path, there is no queryable name property in the node data model) are unique only among a node’s siblings. Moving nodes when only knowing their name thus is not possible.

RE: Carefulness:
I agree. I think a good start would be to check those restrictions that are already in place and find out why they were implemented in the first place.

RE: Overlap:
SPOILER: In the graph-based node model, nodes no longer have names - with one select exception for nodes within node aggregations (e.g. content collections in documents). No more overlap there anymore :slight_smile:

Mmm, did we already discuss how in the graph based model these names are then replaced?

How would I q(node).children('main') ?

That’s the select exception :slight_smile:

To clear this up: Yes, there is a planned property “name” in the node model, but it is usually empty. Content collections for example, that have to be fetchable like you posted, are the only nodes that actually need a name - so they get one

Mmmm… Not sure I entirely agree yet, but that’s another discussion for a different place.

…leading to some kind of overlap again (@christianm I’m with you and I think this discussion is related somehow).

What do you guys think of this approach (just a rough idea, didn’t think it through yet):

  1. Less restrictions upon the node identifiers (they must be globally unique in the CR, have a max length and might not contain certain characters)
  2. Get rid of the “name” property
  3. Instead add named child node ports (i.e. by default child nodes would be under a port “main” but you could have different ones with different constraints)

Well, you can’t get rid of a minimal degree of overlap.
The node name (I’ve tested moving the name from the node to the connecting edge, which probably matches the idea of a “port” closely, works just as well) must be unique among the siblings and the same for e.g. all pages, while the identifier must be unique within the whole subgraph / tree / content context and thus cannot be the same for all pages.
These are conflicting requirements and thus cannot be handled by a single field. Where we put the second one may be open for discussion, but we won’t be able to get rid of it.

Yes, that’s what I had in mind, cool stuff! :wink:

Absolutely, you’re right! The API would be the same anyways.
To find a node anywhere in the tree by its unique identifier:

q(rootNode).find('#<identifier>')

and to find one relative to a given node:

q(someNode).find('<portName>')

I only find it confusing that in our current implementation the child node defines the “port”. I think the other way around it would be less confusing.
And then I’d be fine with having NodeIdentifier a ValueObject that can be more than just a UUID (if well testet ofc, like Christian wrote).

This would be quite breaky since all current examples and documentation let people expect that this will end up with a collectionNode in flowQuery-Context.

I do’nt say we should not do this. Just that we have to be aware that it is breaking some things. Removing the duplication of concepts between nodeName and identifier would be a good thing.

Can you specify what you mean with “port”?

Oh, the result of q(someNode).find('<portName>') would still be a content collection. What changes here is not the API, just the name of things (and probably where stuff is stored)

1 Like

I’ve given what-could-possibly-go-wrong a little thought. Basically it can be broken down to three insights:

1) Whether a node can be resolved by this identifier is widely unrelated to the identifier format.
If you look at the four possible cases:

  • A uuid is given and a node cannot be resolved (e.g. we requested a node with a uuid identifier)
  • A uuid is given and no node can be resolved (e.g. we requested a node with an asset uuid)
  • A non-uuid is given and a node can be resolved (e.g. we requested a node with a non-uuid identifier)
  • A non-uuid is given and no node can be resolved (e.g. we requested a node with an arbitrary string)

it becomes rather obvious that the only relevant thing here is whether the requested node exists or not. If we need validation at that point, we should not use a UUID but a NodeExists validator

2) Strange things can happen if nodes are requested by identifiers that were meant to be something completely different
The worst case scenario that I can think of is that by some means (Node migration, NodeType change, someone messing with the database like I am right now to prompt this), data inserted via some special editor ends up as a node identifier. Say for example someone used something similar to the HTML node type to insert <script>alert("hi");</script> into a property and then changes it to type reference.
I’m not exactly sure if this is possible, but it it was, it ends up like this:

This happens in the current UI too, btw.

I think the best solution would be to distinguish two different actions:

  • If a node’s identifier is explicitly set/changed: less rigid validation, e.g. lowercase, dashes and a maximum of 255 characters. Maybe a NodeIdentifier value object would be the right approach?
  • If a node is to be resolved but cannot be (e.g. because the requested identifier is <script>alert("hi");</script> ), then don’t evaluate it at all. This way, if we want to, we can even leave the data there to support back-switching the type instead of throwing stuff away on publish.

3) We need to collect requirements for node identifier format
Node identifiers are not only used for node retrieval, but also as cache entry identifiers. All those requirements should be modeled somewhere central, e.g. in said NodeIdentifier value object for the upcoming CR rewrite

For the time being, I’d recommend sticking to validation, but in a less restrictive way.
/^([a-z0-9\-]{1,255})$/
should be still be completely viable for all use cases (it also includes all UUIDs) and could replace
/^([a-f0-9]){8}-([a-f0-9]){4}-([a-f0-9]){4}-([a-f0-9]){4}-([a-f0-9]){12}$/ (UUID)