Serialization layer as serious bottleneck

Description

From External Contributors
https://github.com/corda/corda/issues/5725

We have been investigating the performance of our code node. Among a great many thing we managed to optimized and achieved some "ok-ish" numbers. A closer look now revealed that about 80% of the performance now goes away in the serialization layer. This is kind of unexpected as I would rather have expected the database, hashing and asymetric crypto to be the main bottlenecks. The situation is aggrevated by the fact that every transaction has a transaction id. This in turn is computed as the hash of all its elements (input states, notaries, output states, time windows, etc.), which each triggers a serialization again.

For testing purposes to get a closer view we made use of:

and serialized a single state with about two dozen fields. A resulting byte array was 3869 bytes long. One CPU core managed to serialize 2800 of those objects every second. If we assume that a great many objects are part of a transaction, then the pictures gets clearer why it takes this amount of time.

To give a reference, we serialized the same object with ObjectMapper from Jackson by first constructing a writer for the desired state type and then measured performance serializing that state object. Jackson managed to serialize 99500 objects every second. A factor 40 compared to AMQP. The json length of the result was 1065. I consider JSON rather ineffient but managed to be 75% smaller than AMQP while still being "standalone" not requiring a external model to deserialize. ProtoBuffer and friends would be another order of magnitude, but at the cast of an external model.

When looking at it with a profiler, ones sees:

There is heavy work needed to serialize a great number of DescribedTypeElement. A closer look at the implemention shows for example:

see https://github.com/corda/corda/blob/4dd51de5c1d14901ce143502c21b87ac0863543f/serialization/src/main/kotlin/net/corda/serialization/internal/amqp/SerializationOutput.kt

as a first measure might be to cache the serialization of the schema part to directly get the byte array from a given cached schema history. Maybe providing a decent speed-bump. For a database perspective it may also would proof worthwile to seperate the data and the model seperately, avoiding the redudant storage of the model part.

For Corda applications to move towards more high-volume applications, this ticket feels rather important. Alternatively it would kind of also be nice to see plain JSON support (or something similar). There is widespread support across all devices, easy to read/write, standards how to compute a signature and very performant implementations.

Assignee

Unassigned

Reporter

David Rapacchiale

Labels

Feature Team

Select team

Story Points

None

Fix versions

Ported to...

None

Priority

Medium

Affects versions

Configure