Hibernate Query Cache with Snapshot Isolation considered Harmful.

Poison for your cache.

magnayn

--

The Hibernate 2nd level cache

Hibernate has a 2nd level cache, with various implementations. We use EHCache, but there are others.

There are 2 cache ‘regions’ that Hibernate uses to keep track of the cache operation that are important. The first is the UpdateTimestampsCache, which tracks the timestamps of the most recent updates to particular tables. I.E: it looks something like this

UpdateTimestampsCache — Stores the last time each of these tables were touched

Secondly, there is the query cache. It stores the text of a query, the parameters used, and maps to the results (the IDs of the entities) and the timestamp for when the query was done. E.g:

Query Cache

When looking at the above cache, if we perform the first query again, the results can be pulled from cache — because the cache value timestamp is after any of the tables mentioned in the query were updated. The second query in the cache is no longer valid — the timestamp is before TABLE_2 was known to have been updated — so the results may now be different.

The way in which Hibernate decides value to use for “Timestamp” when writing these tables has changed from earlier versions — and therein lies a problem.

Timestamps

In 3.2.7, Hibernate picked the timestamp to record from the current session, using session.getTimestamp(). This would result in a timestamp of ‘when the current session started’.

In 3.6.10, the behaviour changed, and the RegionFactory was asked for ‘nextTimestamp’. This would typically result in a timestamp of ‘now’ (with mechanisms to guarantee unique values).

In 4.1.7, the behaviour further changed, and the cache region itself was asked for the ‘nextTimestamp’. This again typically results in a timestamp of ‘now’.

This works fine if your transactions are in SERIALIZABLE isolation, as database locks will ensure that things work correctly — indeed, the choice of timestamp probably improves the performance of the cache, as the database has guaranteed that changes could not have affected the result.

Serializable

A transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time. 

Snapshots

However… the change is a disaster for SNAPSHOT isolation.

SNAPSHOT isolation is a database mode that has many of the benefits of SERIALIZABLE, but does not need read-locks in order to return consistent results. This has the nice effect that it can prevent readers from blocking writers.

SNAPSHOT isolation is best thought of as exactly that — assume the entire database is ‘frozen in time’ at the start point of the transaction, and the results are always the same regardless of any updates to the database. This is good for readers, but for writers we want the additional consistency guarantees that SERIALIZABLE offers, so we configure hibernate to use SNAPSHOT connections for pure, read-only transactions, but SERIALIZABLE for read-write modify operations.

This causes a problem, because when a SNAPSHOT transaction writes a timestamp as ‘now’, this is not correct — the values it has just read (and not locked) are not guarded against change, and the cache can be poisoned. This occurs when you have overlapping transactions — particularly under load, when the scheduler may start a transaction, the Java code fall idle for a while, then be reactivated.

Activity that poisons the cache

The following describes two concurrent transactions that result in the cache being poisoned:

SNAPSHOTS and a poisoned cache

At this point, the query cache is poisoned. Transaction A at timecode 500 is returned a result of ‘old’ from the database because it is in SNAPSHOT isolation. (A serializable transaction would have returned new). Re-running the SELECT transaction will get a cache hit, and a result of old, even though the database has changed.

I.E: The result ‘old’ is not true for timecode 500 — it’s true for timecode 100 — where the SNAPSHOT transaction began.

NB: The same transactions within SERIALIZABLE

The same transactions, but no poisoning

This is safe, because Transaction A @ timecode 500 knows it will block until the outcome of any concurrent transactions touching that row (id=2) are known.

What to do:

I’m quite surprised that there is little information about this ‘in the wild’ — especially since some databases (Oracle) only do snapshot isolation, and snapshot-like behaviour is common in many other environments (e.g: MVCC in HSQLDB).

There’s various different approaches.

  • Disable query cache if transaction is snapshot (works, but ends up disabling a lot)
  • Disable query cache altogether
  • Revert to old (3.2.7) behaviour

I chose the last. This has been done by producing a new set of Cache Regions / RegionFactory that act to wrap the underlying (EHCache) region and regionfactories. If it is detected that the transaction is within a SNAPSHOT, then the timestamp is returned from the Hibernate Session (as it was in 3.2.7), rather than from what the region determines.

There’s a bit of hackery around making sure that the RegionFactory is aware of the SessionFactory — so it can make that determination — but other than that it’s relatively simple. Let me know if you want the source.

Conclusion

I think Hibernate needs an option in RegionFactories to control this behaviour.

--

--