Friday, June 24, 2011

Isolation levels

Working hard at the Till
Ok so let's say your music shop, 'Breako's Beats' has a very simple database to capture all the CDs for sale in the shop.  The database is just one simple table where each row represents a different music CD such as: 'Rank 1 - Airwave' (this song is playing on iTunes as I am writing this and it's catchy so hence the reference!)  Now, let's say there is a column in the table to indicate how many copies of each CD are available in the shop.  Now this shop is big; there are multiple tills and people can also purchase the CDs from the shop over the web.  In addition, there is someone reponsible for adding new rows as new CDs come into the Breako's Beats store. This means your database will have multiple users all looking at and changing the same table and potentially the same rows in the table at the same time. Yikes...

This means potential concurrency issues.  Now what sort of problems could we expect if we just built this system and never thought about concurrency?

Problem number 1: Dirty Reads

Trent Reznor - how many copies left?
A dirty read is when a transaction reads data from another transaction that has been written but not yet committed. This is problematic if the other transaction never commits the data he has changed.  Suppose a customer asked you to check how many copies of Trent Reznor's 'In Motion' was left and you said 3. But this was because the inventory user was in the middle of transaction which set this value from 0 to 3.  But, the inventory user decided not to commit his transaction for whatever reason.  So it's really 0 copies left but you saw 3.  Unless you check again you are going to end up thinking its 3.


Problem number 2: Non-repeatable reads

Eric Saade is Popular!
This is when a transaction reads the same rows of data more than once and gets different results for those rows. In this case the different answers are from data that has been committed by other transactions. So you could be checking the number of copies you have for: Eric Saade's 'Popular' and you get back different answers in the same transaction.  Why? Because this time the inventory guy added more copies and updated the row corresponding to Eric Saade's 'Popular' and committed his transaction while you were still in the middle of your transaction. As you can see, even though this may be dangerous it isn't as dangerous as dirty reads because at least when the change happened it's from a committed transaction not a momentary blip.

The only enhancement you could make here is to block the other user from making the changes until you're finished.  This would mean you'd lock the rows you're reading.  Because this would slow things down, it really depends on what your system needs to determine if this approach makes sense or not.

Problem number 3: Phantom reads

Andy Weatherall before the smoking ban!
This is similar to non-repeatable reads. This time someone rings your shop and asks you what you have from Andy Weatherall.  The first time you check you get:
You decide to check again in the same transaction but before you do that, the inventory guy got a shipment with more of Andy's stuff. He has added it to the system and committed his transaction.  When you re-read you get back the same as above but you also get back the new additions:
This problem is similar to non-repeatable reads but this time rows have been added not changed. The first time there was only two rows for Andy, the subsequent time there was five.

Again, this is probably not the riskiest thing in the world  - at least you're being told the truth! The solution would again involve blocking the second transaction until you have finished your transaction completely.  As was the case with non - repeatable reads, it really depends on your system requirements (and how probable this problems will happen) to decide if it is actually worth going to the hassle of locking rows.

Isolation levels

In databases, "isolation level" is a property which defines how much concurrent transactions can see or can't see of each other. Isolation is the 'I' in ubiquitous ACID database acronym coined by
Andreas Reuter and Theo Harder in 1983.  Isolation level is a configurable property in most databases.  So what can it be set to?  Well there are usually four settings, namely:

Serializable, Repeatable Reads, Read Committed, Read uncommitted

The table below shows the differences.

Isolation level Dirty reads Non-repeatable reads Phantoms
Read Uncommitted Y Y Y
Read Committed N Y Y
Repeatable Read N N Y
Serializable N N N

Y  means that the problem can happen, N means it can't

My recommendation is set to Read Committed.  The reason why is because serializable can be over protective and will impact performance because it incurs a lot of locking.  Repeatable reads only protect you when you re-read the same data in a transaction. This is something that is uncommon in most systems because too much re-reading usually results in bad performance.   You also need to consider if your transactions are generally long or short.  If you don't have long running transactions it's less likely you are going to have collisions and hence you may not have to be so strict in how isolated the transactions need to be.

Dirty reads can be really problematic.  They are not only misleading but produce unusual problems that are difficult to diagnose. This is because the incorrect data you got was never even persisted.

Therefore I would advise to set to at least read committed to eliminate dirty reads.  Only go for a more stronger setting if your system needs it and if the problems such as phantom reads are probable.

As stated, more protection means more locking. The more locking means the slower your system will go as users get blocked.  Details of the locking are in the table below.

Isolation level Write Lock Read Lock Range Lock
Read Uncommitted N N N
Read Committed Y N N
Repeatable Read Y Y N
Serializable Y Y Y

Y means locking is incurred. N means it is not.

Is that all I need to do?

You wish! Isolation levels for concurrent transactions deal only with concurrent transactions. It is much more likely you'll have concurrent usage of the same data even though you don't have concurrent transactions on the same data.  For example:
  • User 1 reads in the data at the beginning of his session
  • User 2 reads in the same data at the beginning of his session. User 2's session is independent of User 1's session.
  • User 2 then makes some changes and commits change. His transaction is finished.
  • User 1's data has become stale, before he even starts a transaction.  He begins his transaction after User 2 has committed User 2's transaction. But User 1's data is stale.  Surely he should have to respect the most up to date version of the data before making changes!
These scenarios are much more likely to happen in your system and warrant serious attention. They are dealt with by using locking policies such as optimistic locking and pessimistic locking. This involves how you architect your application layer, not just your database.




No comments:

Post a Comment