March 24, 2010

Can ORM make your application faster? Part 6: caching

This is the 6th part of sequence of posts describing optimization techniques used by ORM tools.

Optimization techniques

6. Caching

Caching is the most efficient optimization related to read operations:
  • In case with plain SQL, each read operation (if not batched) requires a roundtrip to database server to be completed. Normally this takes 0.1ms at least.
  • If data required for a particular read operation isn't located in cache on database server, it will require ~ 5...10ms (HDD seek time) at least to be processed. A lot depends on hardare here - e.g. if there are SSDs, this time is much shorter.
  • But in-memory cache may reply to such a request in 0.01ms with ease, if necessary data is available there.
Usually ORM utilizes 2 kinds of caches:
  • Session-level caches. Cache everything for the duration of session lifetime. Session in NHibernate and DataContext in LINQ to SQL or EF are examples of such caches. Normally these caches are used implicitly, do not require to be configured and don't provide any expiration policy (i.e. any data cached there is considered as actual at any further moment of time). Frequently the data from such caches is used directly - i.e. they contain references to actual entities used in the user code.
  • Global caches. Cache everything for duration of application's lifetime, configurable, have flexible expiration policy. Frequently this is an open API allowing to integrate with third-party caching framework (e.g. Velocity or memcached). Data from such caches is never used directly - instead, it is forwarded to session-level caches.
Caches handle the following kinds of requests:
  • Fetching entity by its identifier (key) - fetch request.
  • Fetching a collection of entities by owner's key (entity key) and collection field reference (e.g. PropertyInfo). E.g. in case with DataObjects.Net such collections are fields of EntitySet<T> type. This is collection fetch request.
  • Two above cases may actually be more complex: e.g. it might possible to fetch just a single field of an entity by its key (to handle lazy load request) or fetch just a part of collection (for example, just the count of items there). But all these variants don't change the genericity of common rule: any fetch request includes entity key and additional information identifying the data related to entity with this key, that must be retrieved.
  • Finally, there are query result requests, if ORM supports query result caching.
The main issue with caching (especially, with global cache) is possibility to get stale data, and thus, loose data consistency. Almost any database server is capable of providing a consistent and isolated view on data visible in each transaction - this feature is called transaction isolation. When isolation is provided, you may consider all the transactions there are executed sequentially, so you're always the only user interacting with database right now. This is quite convenient - especially when you don't know exactly what sequence of statements is really executed in your transaction (which is frequent in case with ORMs).

Transaction isolation is actually more complex - to fully understand this concept, you must know what are isolation levels, locks, MVCC, deadlocks, version conflicts, transaction reprocessing and so on. I used a simplified description here to give you an imagination of its extreme case (~ serializable isolation level).

One of requirements you must satisfy to allow database server to provide you with transaction isolation is: all the data you use in a particular transaction must be either explicitly fetched from the database, or be an externally defined parameter of operation this transaction performs. Obviously, utilization of any caching API implies this rule is violated: the data read from cache isn't fetched from the database directly; it isn't an operation parameter as well. So in fact, usage of caching API breakes consistency and isolation.

But on practice it's relatively easy to deal with these issues: you must just carefully decide what and when to cache. 

Practical example

Let's think how we could utilize caching at StackOverflow.com:
  • First page there contains a list of questions. Obviously, this list can be cached. Btw, we could try to cache the whole HTML of this list, but there is a highlighting based on user preferences. So ideally, we must cache query result here. 1...10 seconds expiration must be more then enough for this page.
  • Other pages showing lists of questions are good candidates for the same caching policy. So in fact, we must cache query result varying by query parameters, such as tags and options ("newest", "featured", etc).
  • "Tags", "Users", "Badges" - again, it's a good idea to cache the result of a query rendering a single list there for some tiny period.
  • A page showing answers to a particular question: I suspect it's a bad idea to cache anything related to list of answers there, since such pages are rarely viewed by multiple (hundreds of) users simultaneously.
  • But I'd cache objects related to user's profile (reputation, badge count, etc.); moreover, expiration period here must be a bit longer (minite?). This information is "attached" to almost any list item, and there are lots of such items on any page. So likely, it's a bad idea to fetch it on each request - even with join. On the other hand, there are about 150K users at all, so it's possible to manually "refresh" the whole cache in less then a second. If this happens nearly once per minute, we spend <2% of time on this. Actual policy here must be more flexible: e.g. I'd simply manually mark cache entry as dirty on each update of current user's profile to ensure the changes are shown almost immediately.
  • Finally, I'd ensure any data modifying actions deal with data fetched directly from the database (i.e. without any global cache utilization). "Add answer", "Add comment", "Vote" are examples of such actions. The consistency is essential here; moreover, data modifying operations always imply evaluation of some preconditions on SQL Server (e.g. check for presence of expected key = index seek), so reading the same data before subsequent write must not significantly increase the time required to perform the whole operation. Btw, you must notice such operations are normally executed via AJAX requests on StackOverflow - so it is really easy to apply a special caching policy for this special API.
  • I'd add an API allowing to update # of views of a particular question in bulks with delay, otherwise we'd get a write transaction on any question view. Such an API must simply gather such info during a certain period (or until certain limit is reached), and update all the affected counters in a single transaction afterward.
Such a caching policy, on one hand, ensures consistency isn't severely broken (e.g. a bit outdated user ranks and results are ok for us), and on the other, allows the application to render the most important pages much faster - in fact, all the essential lists are fetched from server just once per second; user info is updated nearly once per minute.

Btw, there is some info on actual caching policy used by StackOverflow.com.

Typical caching policies

I'll try to enumerate the most important cases, since a lot depends on a particular scenario here:
  • Static, or nearly static info. It is either never updated, or is updated so rarely that it is possible to send purge request to cache on its update and wait till its completion. If all these actions are performed during transaction updating such static info, we can easily ensure the consistency isn't broken (we must just ensure shared lock is acquired any time such info might be propagated to cache).  Otherwise there is a chance we might get stale info until it is expired, but frequently even this is acceptable (e.g. it's ok for user rank info @ StackOverflow; normally the same can be said about such info as user membership in groups and so on).
  • Rarely changing, but not really essential info. A good example is user reputation at StackOverflow: if we would cache it for a minute for some user, most likely no one would notice this. So the simplest caching policy with expiration can be used here, except for updates (they must deal with actual info).
  • Frequently changing, but frequently accessed info. This is the most complex scenario, everything depends on actual conditions here. The simplest case is page view counter-like info: ensure it is updated in separated transaction w/o any caching (ideally, for multiple counters in bulks) and allow UI to read its cached version expiring e.g. once per minute. The complex case is something like frequently mutating directory-like structure: you can't cache each folder individually, since they won't expire simultaneously, and thus you have a chance of getting absolutely inconsistent view (there can be loops, etc. - the graphs that are simply impossible when there is no cache). My advice here is to avoid any caching, if this is possible. If not, you must think about caching the whole graphs of such objects - next section describes this case.
  • Cached lists or graphs. The case with lists at StackOverflow is typical: we're going to cache first N pages of each of such list to provide them faster. Btw, it must be a good idea to use rendered content-level caching instead of entity-level caching here.
Caching APIs

Basically, there are just two cases:
  • Explicit API. Allows to get cached object by its key, purge the object by its key and so on. See e.g. Velocity API for details. Good, when fine-tuned caching control is preferable (may be 20% of cases when caching is necessary).
  • Implicit API. A set of runes defining default caching policy for entities. E.g. it might allow to state that ApplicationSettingsUser, Role and UserSettings types are always handled as "nearly static" objects transparently for us; the same approach must work for queries containing cache control directives. Good, when one of default policies is fully suitable (other 80% of cases).
Obviously, I'd prefer to have both APIs available ;)

Recommendations
  • Avoid using caching, if it isn't really necessary. 
  • Pay attention to everything else first - e.g. transaction boundaries, possibility to use bulk updates, etc. In particular, profile the application.
  • On the other hand, if there is a chance caching will be necessary, design for caching - e.g. if some entity contains large part of rarely and tiny part of frequently changing info, it's a good idea to split it into two entities.
  • Identify few places where load is maximal, and start employing caching just there. Move iteratively, until desirable performance is reached.
  • Care about consistency. Inconsistency-related bugs are quite difficult to identify. Avoid using caches at all in data modifying transactions.
Finally,

using explicit caching with plain SQL isn't difficult at all, but providing implicit caching API must be much more complex.

Return to the first post of this set (introduction and TOC is there).