RFC: Use Doctrine guid type for persistence_object_identifier

Motivation

Flow magically injects a property called persistence_object_identifier into entities or value objects that don’t define an ID column on their own. This property is defined as a string, and Doctrine will store it in a column of type varchar with length 40. Flow automatically stores a UUID in it.

When digging into the depths of Doctrine, I stumbled upon a field type called guid, designed to handle UUIDs.
Some database platforms have a column type specifically for UUIDs. Doctrine will use that type to store guid fields if it runs on such a platform. If the platform has no specific type for UUIDs, Doctrine will fall back to varchar(36) instead.
For example, Doctrine will use uuid as column type on PostgreSQL. Since MySQL has no specific type, it will default to varchar(36) there. Other platforms supported by Doctrine with a specific guid column type are SQLAnywhere and MS SQL Server.

Why not use guid as persistence_object_identifier field type?

I know this is a huge change to Flow, but it might be worth it. Please tell me what you think.

Comparison between varchar(40) and guid on a PostgreSQL machine

Two basic tables, each has two columns: A primary key, varchar(40) on table 1, uuid on table 2. Each table has an index on the primary column. The second column is just a data payload.
Each test was run five times, time figures are average.

INSERT of 1.000.000 rows, values generated with Flow’s Algorithms::generateUUID():
Table 1: 3899 seconds
Table 2: 3910 seconds

SELECT COUNT (DISTINCT col1):
Table 1: 629 ms
Table 2: 208 ms

1.000.000 times SELECT * using a WHERE clause with previously selected IDs:
Table 1: 112 seconds
Table 2: 109 seconds

100.000 UPDATEs on the payload column using a WHERE clause with previously selected IDs:
Table 1: 391 seconds
Table 2: 381 seconds

I know that these figures only show a very moderate performance gain. The test cases are constructed and not all of them are close to what happens in the real world.

Case Study on PostgreSQL

I’m working on a Flow project that has a very large and very complex PostgreSQL database with millions of datasets. I use a custom PHP trait to change the column type of persistence_object_identifier to guid.
This brought a considerable performance boost.

I don’t have solid figures on the improvement, only that the number of queries slower than 100ms is now at least 50% lower than before.
There are lots of JOINed queries in my project, and I assume that this is common to most, if not all, Flow-based projects. In my project there is a huge performance gain for queries with complex JOINs. Unfortunately I don’t have any before-and-after figues on them. But think of some web requests taking 1/4 less time and importer processes improving by 1/4 to 1/3.

Caveats

When working with the guided persistence_object_identifier on PostgreSQL, I noticed that there are issues when Flow directly uses identifiers based on user input. PostgreSQL checks incoming UUIDs for wellformedness, and if they aren’t well-formed, the query fails.

For example, if you use routes with custom routeParts with objectType, but no special uriPattern, the IdentityRoutePart will use the object’s identifier as the URI segment, resulting in UUIDs in the URL.
When matching an incoming URL (and there is no cache yet), Flow may query the database for objects that are identified by that segment, even if it’s not a UUID. It’s not unlikely that the incoming string is not a UUID, making Flow fail the request with error 500.
To prevent that, I had to implement a custom child class of IdentityRoutePart that checks if the incoming segment really is a UUID first.

If Flow opts for a transition to a guided persistence_object_identifier, a similar functionality would have to be implemented in IdentityRoutePart and other classes that use foreign data as an object identifier.

Migrating an existing Flow instance

PostgreSQL

Doctrine builds a migration to change the column type from varchar(40) to uuid. You have to manually add a USING persistence_object_identifier::uuid type cast to the ALTER TABLE statement. Each existing foreign key/constraint and index on a changed column has to be dropped and re-created. Doctrine does not regard these changes in migrations on its own, so they have to be put there manually.

Consequences of the guid change to MySQL-based installations

On migration, each persistence_object_identifier column will change its varchar length from 40 to 36. In most cases that shouldn’t be a problem, since canonical string representations of UUIDs are 36 chars long.
Only for some edge cases when there are non-GUID-persistence_object_identifiers this can cause trouble.

See also

http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/basic-mapping.html#doctrine-mapping-types

https://www.postgresql.org/docs/9.6/static/datatype-uuid.html

2 Likes

Another detail in favor of UUID in PostgreSQL is the fact the varchar includes collation and text related validations, which in the case of UUID is native/binary, PostgreSQL is slightly faster with UUID and column size is noticable smaller with indexes and column sizes (e.g. with my DB including about 1 million of records)

1 Like

Hey, thank you for the write up, this really helps the drive to make a decision on this topic.

The numbers only show a small performance benefit, except for the SELECT COUNT (DISTINCT col1) . I’d guess this more shows the potential for queries involving more than a single identifier. So the numbers are probably scewing perspective a bit, but are nonetheless helpful, so thank you again!

To prevent that, I had to implement a custom child class of IdentityRoutePart that checks if the incoming segment really is a UUID first.

This does not sound like an unreasonable thing to do generally, so I’d be in support of getting this into Flow as a first step.

Migrating an existing Flow instance

I guess that’s probably the most crucial thing about this change, so this has to be planned and documented very well if this is supposed to go in. So it would be worth spending some time on thinking about how to solve this as smoothly as possible (maybe provide a script that helps in preparing the migrations).