Fix race in AppendOnlyPersistentMap

Description

There's a race condition in the logic of AppendOnlyPersistentMap when rare duplicates are encountered. An example that might occur is duplicate (Corda) transactions being recorded from transaction resolution. Indeed we've seen bugs due to this in the past. Here's the scenario, for which I think we can build a test. The race can only occur in the multi-threaded state machine, so the test will need at least need two open database connections/transactions which may require two threads.

Flow Thread 1 - Start DB transaction
Flow Thread 2 - Start DB transaction
Flow Thread 1 - txMap.put("TX1", <transaction1>)
^^^ This line sees nothing in memory and nothing in database, so generates insert statement
Flow Thread 2 - txMap.put("TX1", <transaction1>)
^^^ This line sees something in memory, so does not insert
Flow Thread 2 - Commit DB transaction
<--- Imagine JVM dies here.... Flow Thread 2 thinks we have recorded TX1, but because Flow Thread 1 database transaction is not committed, there is no TX1 in the database. Restarting Flow Thread 2 on JVM restart, the cache is empty and the flow could error due to not being able to find TX1.

Restart JVM (in test, throw away map instance and rollback Flow Thread 1 database transaction)
Flow Thread 2 - Start DB transaction
Flow Thread 2 - txMap.get("TX1") <---- Should return something, but currently returns null

Eventually, after the restart, Flow Thread 1 should continue and insert TX1, but there's a window where Flow Thread 2 could see something unexpected and not be able to find TX1.

A similar situation could apply to e.g. identities (think that uses the same map implementation) etc.

Off the top of my head (so don't assume this is the only approach), I think we could change the append only map somewhat to a) avoid this issue and b) make it simpler by avoiding all the logic around is there or isn't there already a row in the database. We add some additional column to the database primary key. For flows, this would be the flow ID, and when not on a flow we use either a random number or the current time (providing we don't use the same time twice, otherwise we'll get duplicates, especially on windows with it's low resolution clock). We would then get genuine duplicates in the database - but hopefully rarely.

This column should be the last column of the primary key, so that we can continue to select rows without knowing the random number/flow ID (i.e. using primary key of the map, rather than the table). The select logic would need to be modified to restrict the select result to one row in case there are duplicates in the database.

One additional benefit of this for the transactions is that we maintain a separate table to record what flow recorded what transaction, and we would now have all this in one table, simplifying the database and boosting performance due to less overall inserts, and especially because we no longer need to check for existence of entries before inserting.

Assignee

Rick Parker

Reporter

Rick Parker

Labels

Priority

High

Fix versions

None

Ported to...

None

Feature Team

Performance and Platform Sustainability

CVSS Vector

None

Engineering Teams

None

Severity

Medium

Affects versions

Configure