FAIL

Hibernate Query Cache with Snapshot Isolation still harmful!

magnayn

--

A tale of L2 Cache misses and ObjectNotFoundException

In what seems a lifetime ago, I wrote about a nasty situation that can lead to applications poisoning their cache, which will not clear until the cache times out the query entry (which if it’s heavily used, might be “never”).

To be fair, that and the below may only apply to older versions of Hibernate. The ticket I raised for the poisoning issue was closed, and I wasn’t able to recreate it by using a test program in Hibernate 5 with spring boot. Lots has happened between that post and now, so modern stacks may be OK.

However, back in $dayjob, we are still on the Hibernate 4.2 line. And this week we discovered another nasty interaction that gets you errors that look a lot like

org.hibernate.ObjectNotFoundException: No row with the given identifier exists: [my.corp.object.type.thing.BlahBlah#679bac6b-a20d-11e8–9e02–6ff4eec01fb5]

This occurs particularly under load, and the reason it occurs is quite subtle, and will only happen if you are using snapshot isolation.

Here’s what happens:

Session A starts, with timestamp 100. It may do some work, but then goes to sleep or is descheduled.

Session B starts, with timestamp 200. It writes a new object

{id:123,user:1000,value:’foo’}

into the database. Hibernate writes the change timestamp for the table as 200, and places the entity state into the L2 entity cache. It commits it, and ends.

Session C starts, with timestamp 300. It performs a query of ‘find me all the objects where user=1000’, and gets a response of [{id:123,user:1000,value:’foo’}]. Hibernate puts the query into the L2 query cache. The session ends.

Session A re-awakens. It also performs the query ‘find me all the objects where user=1000’

Hibernate looks in the L2 cache. A result is found! Is the result valid? Well, none of the tables mentioned in the query have a timestamp of after when the query was run, so yes — the results are valid.

The Query Cache results just contain the IDs, so it now needs to convert the ID 123 into an actual object. So first of all, the entity cache is consulted, and an entry is found!

However, Hibernate looks at the timestamp of the current session (A) and asks “was this entity written after the session started?”. As the answer is “yes”, it rejects the cache entry.

The cache “miss” then falls through to the database now being asked to return that row.

If we were in SERIALIZABLE, this would work. However, in SNAPSHOT, the results we get will be consistent as from the point of when we started the transaction, so the database says “nope, no row for that”. The Select for ID 123 will return no rows, our database view is frozen in carbonite.

Boom — ObjectNotFoundException.

Fortunately, there are workarounds.

One would be if the internals to hibernate, when asking “is this L2 query cache entry valid”, if in addition to asking “did any table change after this entry”, could also ask “are we in snapshot isolation, and did this entry get stored after our session started?”, that would not then return the unusable response. (Hibernate5 may in fact do this). This would be somewhat similar to what is happening in the entity cache.

You could regard this as an acceptable thing — much like deadlocks — they happen, you just back off and retry.

You could catch the exception, and retry the query without the cache turned on.

Hopefully if you’re seeing the same issue, this may give you some hints as to what it might be!

--

--