March 30, 2010

DataObjects.Net v4.2 installer is updated

I finally decided to update v4.2 installer to the latest nightly build - there are few more minor fixes (the issues are described in our support forum). It is uploading to our Download area right now.

This also perfectly demonstrates our new "continuous stability" policy - now we really can publish a bugfix release on the next day.

Why next day? Currently each night is spent on thorough testing of "stable" branch:
  • Default test sequence conststs of 1020 storage-independent tests (Core, etc.) + 1550 storage-dependent ones on 10 different database engines (so efficiently there are 1020 + 10 * 1550 tests) running on each commit. This sequence exposes about 95-99% of issues immediately.
  • Nightly test sequence is running all storage-dependent tests in 6 different configurations (overriding default mappings with 3 different inheritance mapping strategies and 2 different TypeId mapping policies, details are here), so efficiently there are 1020 + 60 * 1550 = 94K tests. And they really take several hours to run on our testing farm consisting of 4 dedicated, but moderate PCs. A joke around this: when a PC gets old, it is plugged into matrix as an Agent (our testing PCs are named AgentSmith, AgentJohnson, AgentThompson, etc.).
So on each morning it's quite precisely known if our current build is really stable.

Viva Mercurial! Viva TeamCity!

March 29, 2010

Xtensive Spectator: our corporate Skype bot :)

We use Skype for for many of our in-company talks - mainly, because of its group chats. Historically we had "Came in, came out" chat there ("Пришел, ушел" in Russian) - and there are quite simple rules for writing into this chat:
  • If employee enters our office, it writes "++" there (or "+1" ;) ) 
  • If employee leaves our office, it writes "--" there
  • If someone starts to work @ home, it types "++ @ home", and vice versa
  • Generally, it's a bad idea to write anything else to this chat.
Profit:
  • All of us know when someone comes or leaves the office
  • All of us know who cames too late (actually, this doesn't work as we expected)
  • Finally, it's possible to make some statistics based on this info. 
The last part (statistics) was implemented few months ago: Marat Faskhiev has written a Skype bot capable of participating in our "Came in, came out" chat and providing textual reports on commands like "report" sent to it personally. 

Screenshots

++ and -- messages:


Reports:

("month" is report showing hours we spent in March; only a part of employees is visible on screenshots)

Obviously, this is just a kind of evaluation for us - i.e. as far as I know, such a "big brother" isn't seriously taken into account by most of our employees; on the other hand, I believe finally it can be useful. Statistics it provides gives some insight on how much time people really spend on work, and which of them work harder.

And, what is more attractive, such a bot can be extended to e.g. do the following:
  • Pariticipate in all important Skype chats as corporate chat recorder
  • Provide browser-based UI allowing to search chat history & provide permalinks to certain discussions (e.g. related to bugs), and obviously, browse various statistics
  • Operate as intagration point: e.g. with RSS feeds (most likely there are such bots), issue trackers, build server, Twitter and so on.
So the question is: what do you think, can this tiny application evolve to a corporate tool for small companies like ours, that intensively use Skype, and what features must be added to it to make it really useful?


Btw, would you like to see it as DO4 sample application?

March 26, 2010

New repository layout and stable nightly builds

This is important, if you're working with our Mercurial repository. New layout is described in Readme.txt located in repository root folder.

Most important change: latest stable version is in stable branch now, latest development version is in default branch. So to get the latest stable code, type:
hg pull
hg up stable
And, since this step is done, nightly builds are now always built from the stable branch - i.e. there are no any severe changes like new features, but only bugfixes, new samples and possibly, improved helpers (like DisconnectedResult, that will be improved and moved to Xtensive.Storage later).

So in short, from this moment it's much safer to use nighly builds: they are at least as stable as the latest release.

Discounts & subscriptions

I just updated "License" page in DataObjects.Net Wiki - now it reflects some changes we're planning to implement. There are:
  • Discounts: right now we offer 15% discount for new orders, and 10% - for subscription prolongations. The offer is intact till March 31, 2010.
  • Subscriptions: now you can pay for SMB and Enterprise licenses by ordering a subscription with 2-month billing period. So totally there are 6 payments, 20% of pay-at-once license cost each, thus total license cost is about 20%. But imagine: 99 USD per month - it's cheaper than VPS, and there is 15% discount! Conditions will never be more attractive. I wrote we'll adjust the prices in June, mainly to properly position the product relatively to its competitors. Check out e.g. this one, and think how much of such functional can be covered by DO4 with DisconnectedState.
  • Internal license is not available now.
P.S. First version of WCF sample is pushed to Google Code, but I'll be able to describe it only tomorrow. You'll find it in the nightly build as well (will be available ~ in 10 hours).

March 25, 2010

Upcoming changes

After pretty long analysis of DataObjects.Net marketing & sales strategy, we decided to make a set of dramatic  changes related to it. Today I'll list just the most important decisions and intentions.

Most of decisions described here will be implemented in Summer (starting from June). But we may decide to implement few of them ASAP (I'll explicitly mention which ones).

1. Licensing policy

Now it's obvious that our current licensing policy may lead to fiasco:
  • Commercial licenses are indistinguishable by features. So the only reason to acquire appropriate license type is legal - but it's desirable (for customers) to feel some real difference between differently priced licenses.
  • Moreover, GPL version is indistinguishable from commercial. So people evaluating the product can do this infinitely, and just one this factor significantly depreciates the product in their mind.
  • Originally we provided GPL license pursuing two goals: a) to simplify evaluation of the product when it was really crude (it is still not ideal, but now it is very close to be stable), and b) to attract more people by "freeware" nature of this license. As far as we can judge, part b) this didn't work: people that use mainly freeware products anyway distinguish between partially commercial and completely free ones; moreover, it seems presence of GPL license finally makes product less attractive for the people looking for commercial alternatives. So this license does nothing but blurring commercial positioning of the product, and that's bad.
Decisions:
  • We'll eliminate Internal license. This may be done ASAP. Presence of this license adds unnecessary complexity to relatively simple licensing model we have now.
  • GPL license must be replaced by Community / Academic Editions. They'll be limited by features, although the restrictions there must be mainly annoying (limited usage of productivity & integration related features). Or in short, restrictions there will be more relaxed in comparison to Express Edition of DataObjects.Net 3.X.
  • Enterprise, SMB and Personal licenses must also differ by features. Access to source code, nightly builds & repository, different response time limits for support requests and additional consulting hours must be main differentiating factors here. Enterprise version must further include some additional scalability-related features (e.g. something like "fine granted cache control" or "integration with memcached").
  • This implies license keys and checks are returning back. Of course, they'll work only in compile time.
  • MVC (most valuable customers) will be provided by most of benefits of Enterprise license independently of their actual license type.
  • GPL version of DataObjects.NET will be anyway available for some period, but its code base will be updated with a significant delay; moreover, its feature set will be equal to Community Edition.
To summarize above, further we will explicitly position DataObjects.Net as commercial product.

2. Pricing

Our current pricing policy is not aligned with competitors - mainly, we're providing noticeably cheaper option, taking into account we sell only company-wide licenses. So this must be fixed in June.

But the worst thing is, again, presence of freeware "GPL edition". Being same-featured, it significantly depreciates other options, so we should make this edition distinguishable by features.

And finally, we will:
  • add subscription model for any of licenses. Subscribers will pay monthly or quarterly; yearly subscription price will be ~ 15-20% higher then standard license cost, but each payment will be relatively small.
  • provide 45-day money back guarantee.
  • both these options will be available till the end of this month.
3. Support

Since the product is young, we are providing help to all the people using the product - and we'll continue to do the same. But we must:
  • Slowly bring certain differentiation in e.g. speed of free and paid support. We must finally make paid support valuable.
  • Provide something better then our support forum. As part of this, we must finally solve the issue with different logins and passwords.
  • The policy of frequent updates of installers along with nightly builds (to allow users of non-commercial versions to download the latest bugfixes) must be stopped. This completely depreciates nightly builds.
4. v4.3, v2010

Most of listed changes will be bound to release of new version of DataObjects.Net. We already pretty tired of 4.X, so the new one will be DataObjects.Net 2010, that will be a successor of planned v4.3.

5. Discount periods

There will be few discount periods before pricing and licensing model changes. The nearest one will start on this week. So if you're thinking about purchasing a commercial license, stay tuned - likely, this spring will be the best time for acquiring a commercial license.

Nice Bloom filter application

Today I accidentally found a couple of interesting files in one of Google Chrome folders:
  • Safe Browsing Bloom
  • Safe Browsing Bloom Filter 2
Conclusion: Chrome uses Bloom filters to make a preliminary decision whether a particular web site is malicious or safe. Cool idea!

Let me explain how it works:
  • Web site URL is converted to some canonical form (or a set of such forms - e.g. by sequentially stripping all the sub-domains; in this case check is performed for each of such URLs).
  • N of its "long" hashes are computed (likely, there are 64-bit hashes).
  • The value of each hash is multiplied by scale factor. That's how N addresses of bits in bit array called Bloom filter are computed.
  • The filter is designed so that if all these bits are set, the site is malicious with very high probability (in this case Chrome performs its precise verification - likely, by issuing a request to the corresponding Google service); otherwise it is guaranteed to be safe.
  • More detailed description of Bloom filters is here.
The benefits of this method:
  • The size of such a filter is considerably smaller than size of any other structure providing precise "yes/no" answers to similar questions. For example, if it is a set (data structure), its data length should be nearly equal to the total length of all the URLs of malicious sites it "remembers". But a Bloom filter providing false positive responses with a probability of 1% (note that it can't provide false negative response) would require just 9.6 bits per each URL of malicious web site it "remembers", i.e. nearly 1 byte! Taking into account that size of Chrome Bloom filters is about 18Mb and assuming they are really built for such a false positive response probability, this means they contain information about ~ 1 million of malicious web sites!
  • Bloom filter allows Chrome to use precise verification service practically only when the user actually goes to a malicious web site. Isn't it wonderful? ;)
P.S. I paid attention to these files mainly because our Xtensive.Indexing project contains excellent implementation of Bloom filters, and Xtensive.Core contains an API allowing to calculate 64-bit hashes. These hashers  are used by our Bloom filters to obtain the necessary number of 64-bit hash values (or, more precisely, the necessary number of values of unique 64-bit hash functions). Our hashing API provides 64-bit hashers for base CLR types and allows you to write them for your own types (you should just implement a Hasher for each of your own types; this is pretty simple, because hashers for field values are already available).

March 24, 2010

Can ORM make your application faster? Part 6: caching

This is the 6th part of sequence of posts describing optimization techniques used by ORM tools.

Optimization techniques

6. Caching

Caching is the most efficient optimization related to read operations:
  • In case with plain SQL, each read operation (if not batched) requires a roundtrip to database server to be completed. Normally this takes 0.1ms at least.
  • If data required for a particular read operation isn't located in cache on database server, it will require ~ 5...10ms (HDD seek time) at least to be processed. A lot depends on hardare here - e.g. if there are SSDs, this time is much shorter.
  • But in-memory cache may reply to such a request in 0.01ms with ease, if necessary data is available there.
Usually ORM utilizes 2 kinds of caches:
  • Session-level caches. Cache everything for the duration of session lifetime. Session in NHibernate and DataContext in LINQ to SQL or EF are examples of such caches. Normally these caches are used implicitly, do not require to be configured and don't provide any expiration policy (i.e. any data cached there is considered as actual at any further moment of time). Frequently the data from such caches is used directly - i.e. they contain references to actual entities used in the user code.
  • Global caches. Cache everything for duration of application's lifetime, configurable, have flexible expiration policy. Frequently this is an open API allowing to integrate with third-party caching framework (e.g. Velocity or memcached). Data from such caches is never used directly - instead, it is forwarded to session-level caches.
Caches handle the following kinds of requests:
  • Fetching entity by its identifier (key) - fetch request.
  • Fetching a collection of entities by owner's key (entity key) and collection field reference (e.g. PropertyInfo). E.g. in case with DataObjects.Net such collections are fields of EntitySet<T> type. This is collection fetch request.
  • Two above cases may actually be more complex: e.g. it might possible to fetch just a single field of an entity by its key (to handle lazy load request) or fetch just a part of collection (for example, just the count of items there). But all these variants don't change the genericity of common rule: any fetch request includes entity key and additional information identifying the data related to entity with this key, that must be retrieved.
  • Finally, there are query result requests, if ORM supports query result caching.
The main issue with caching (especially, with global cache) is possibility to get stale data, and thus, loose data consistency. Almost any database server is capable of providing a consistent and isolated view on data visible in each transaction - this feature is called transaction isolation. When isolation is provided, you may consider all the transactions there are executed sequentially, so you're always the only user interacting with database right now. This is quite convenient - especially when you don't know exactly what sequence of statements is really executed in your transaction (which is frequent in case with ORMs).

Transaction isolation is actually more complex - to fully understand this concept, you must know what are isolation levels, locks, MVCC, deadlocks, version conflicts, transaction reprocessing and so on. I used a simplified description here to give you an imagination of its extreme case (~ serializable isolation level).

One of requirements you must satisfy to allow database server to provide you with transaction isolation is: all the data you use in a particular transaction must be either explicitly fetched from the database, or be an externally defined parameter of operation this transaction performs. Obviously, utilization of any caching API implies this rule is violated: the data read from cache isn't fetched from the database directly; it isn't an operation parameter as well. So in fact, usage of caching API breakes consistency and isolation.

But on practice it's relatively easy to deal with these issues: you must just carefully decide what and when to cache. 

Practical example

Let's think how we could utilize caching at StackOverflow.com:
  • First page there contains a list of questions. Obviously, this list can be cached. Btw, we could try to cache the whole HTML of this list, but there is a highlighting based on user preferences. So ideally, we must cache query result here. 1...10 seconds expiration must be more then enough for this page.
  • Other pages showing lists of questions are good candidates for the same caching policy. So in fact, we must cache query result varying by query parameters, such as tags and options ("newest", "featured", etc).
  • "Tags", "Users", "Badges" - again, it's a good idea to cache the result of a query rendering a single list there for some tiny period.
  • A page showing answers to a particular question: I suspect it's a bad idea to cache anything related to list of answers there, since such pages are rarely viewed by multiple (hundreds of) users simultaneously.
  • But I'd cache objects related to user's profile (reputation, badge count, etc.); moreover, expiration period here must be a bit longer (minite?). This information is "attached" to almost any list item, and there are lots of such items on any page. So likely, it's a bad idea to fetch it on each request - even with join. On the other hand, there are about 150K users at all, so it's possible to manually "refresh" the whole cache in less then a second. If this happens nearly once per minute, we spend <2% of time on this. Actual policy here must be more flexible: e.g. I'd simply manually mark cache entry as dirty on each update of current user's profile to ensure the changes are shown almost immediately.
  • Finally, I'd ensure any data modifying actions deal with data fetched directly from the database (i.e. without any global cache utilization). "Add answer", "Add comment", "Vote" are examples of such actions. The consistency is essential here; moreover, data modifying operations always imply evaluation of some preconditions on SQL Server (e.g. check for presence of expected key = index seek), so reading the same data before subsequent write must not significantly increase the time required to perform the whole operation. Btw, you must notice such operations are normally executed via AJAX requests on StackOverflow - so it is really easy to apply a special caching policy for this special API.
  • I'd add an API allowing to update # of views of a particular question in bulks with delay, otherwise we'd get a write transaction on any question view. Such an API must simply gather such info during a certain period (or until certain limit is reached), and update all the affected counters in a single transaction afterward.
Such a caching policy, on one hand, ensures consistency isn't severely broken (e.g. a bit outdated user ranks and results are ok for us), and on the other, allows the application to render the most important pages much faster - in fact, all the essential lists are fetched from server just once per second; user info is updated nearly once per minute.

Btw, there is some info on actual caching policy used by StackOverflow.com.

Typical caching policies

I'll try to enumerate the most important cases, since a lot depends on a particular scenario here:
  • Static, or nearly static info. It is either never updated, or is updated so rarely that it is possible to send purge request to cache on its update and wait till its completion. If all these actions are performed during transaction updating such static info, we can easily ensure the consistency isn't broken (we must just ensure shared lock is acquired any time such info might be propagated to cache).  Otherwise there is a chance we might get stale info until it is expired, but frequently even this is acceptable (e.g. it's ok for user rank info @ StackOverflow; normally the same can be said about such info as user membership in groups and so on).
  • Rarely changing, but not really essential info. A good example is user reputation at StackOverflow: if we would cache it for a minute for some user, most likely no one would notice this. So the simplest caching policy with expiration can be used here, except for updates (they must deal with actual info).
  • Frequently changing, but frequently accessed info. This is the most complex scenario, everything depends on actual conditions here. The simplest case is page view counter-like info: ensure it is updated in separated transaction w/o any caching (ideally, for multiple counters in bulks) and allow UI to read its cached version expiring e.g. once per minute. The complex case is something like frequently mutating directory-like structure: you can't cache each folder individually, since they won't expire simultaneously, and thus you have a chance of getting absolutely inconsistent view (there can be loops, etc. - the graphs that are simply impossible when there is no cache). My advice here is to avoid any caching, if this is possible. If not, you must think about caching the whole graphs of such objects - next section describes this case.
  • Cached lists or graphs. The case with lists at StackOverflow is typical: we're going to cache first N pages of each of such list to provide them faster. Btw, it must be a good idea to use rendered content-level caching instead of entity-level caching here.
Caching APIs

Basically, there are just two cases:
  • Explicit API. Allows to get cached object by its key, purge the object by its key and so on. See e.g. Velocity API for details. Good, when fine-tuned caching control is preferable (may be 20% of cases when caching is necessary).
  • Implicit API. A set of runes defining default caching policy for entities. E.g. it might allow to state that ApplicationSettingsUser, Role and UserSettings types are always handled as "nearly static" objects transparently for us; the same approach must work for queries containing cache control directives. Good, when one of default policies is fully suitable (other 80% of cases).
Obviously, I'd prefer to have both APIs available ;)

Recommendations
  • Avoid using caching, if it isn't really necessary. 
  • Pay attention to everything else first - e.g. transaction boundaries, possibility to use bulk updates, etc. In particular, profile the application.
  • On the other hand, if there is a chance caching will be necessary, design for caching - e.g. if some entity contains large part of rarely and tiny part of frequently changing info, it's a good idea to split it into two entities.
  • Identify few places where load is maximal, and start employing caching just there. Move iteratively, until desirable performance is reached.
  • Care about consistency. Inconsistency-related bugs are quite difficult to identify. Avoid using caches at all in data modifying transactions.
Finally,

using explicit caching with plain SQL isn't difficult at all, but providing implicit caching API must be much more complex.

Return to the first post of this set (introduction and TOC is there).

March 19, 2010

Can ORM make your application faster? Part 5: asynchronous batch execution

This is the 5th part of sequence of posts describing optimization techniques used by ORM tools.

Optimization techniques

5. Asynchronous batch execution

Disclaimer: as far as I know, this optimization is not implemented in any of ORM tools yet. So this post is nothing more that a pure theory.

A part of batches are intended to return some result to the user code immediately; but there are batches that are executed just for their side effects (i.e. database modifications). The only interesting result here is an error, and usually it does not matter if you'll get it now, or later in the same transaction (moreover, in many cases, e.g. with particular MVCC implementations, certain errors are really detected just on transaction commit). 

Based on this, we can implement one more interesting optimization:
  • All the batches executed just for side effects are executed assynchronously, but certainly, sequentially and synchronously on the underlying connection object. This can be achieved, if all the underlying job is done in background thread dedicated to this. Let's call the abstraction executing our batches as AsyncProcessor (AP further).
  • The batches returning some result are executed synchronously by AP, and any errors it gets on their execution are thrown to application code directly.
  • If some batch executed asynchronously by AP fails, AP enters error state. Being in error state, it re-throws the original exception on any subsequent attempt to use it. Error state is cleared when each  new transaction is starting.
Result of this optimization: when mainly CRUD operations are performed, the thread preparing the data for them (i.e. the thread creating and modifying entities) doesn't waste its time waiting for database server replies. So application and database server operate in parallel in this case.

Ok, but how much it can speedup e.g. bulk data import operation? My assumption is up to 2 times (of course, if CRUD sequence batching or generalized batching is implemented). A good approval for this is CRUD test sequence result at ORMBattle.NET:
  • Compare DataObjects.Net result vs SqlClient on CRUD tests. DataObjects.Net implements generialized batching, but does not implement asynchronous batch execution, and its result is nearly two times lower then result of SqlClient. Note that SqlClient test there is explicitely optimized to show maximal possible performance - so it batches CRUD commands as well.
  • On the other hand, there is BLToolkit, which, althought does not privide automatic CRUD batching, offers an explicit API for this (SqlQuery.DeleteUpdate and Insert methods accepting batch size and sequence of entities to process (IEnumerable<T>)). In case with such an API, ORM must do almost nothing except transforming and passing the data further (e.g. no topological sorting is necessary to determine valid insertion order), so BLToolkit shows nearly the same result as SqlClient.
And obviously, this optimization won't affect much on transactions doing mainly intensive reads.

Implementing this optimiziation using plain SQL isn't really difficult, if generalized or CRUD batching is already implemented - note that it is required to be implemented here, and that's where real problems are.

Return to the first post of this set (introduction and TOC is there).

Plans for DataObjects.Net v4.3 and future goals

Near-time plans

Yesterday we discussed features that must compose v4.3, and finally stopped on the following set:
  • N-tier usability extensions, sample in Manual showing how to use DisconnectedState in multi-tier application. Mainly, we're going to show how to deliver query results from remote domains, merge them into the current DisconnectedState, send queries remotely with DO4, send just operation logs and so on. We're going to add a set of helpers dramatically simplifying these tasks.
  • Full serialization support. Our serialization framework is currently tested mainly in "by reference" serialization mode. "By value" mode is also available, but there are almost no tests for it. So we're going to simply finish with this task. Serialization is really important feature: "by reference" mode helps to serialize query results to deliver them to remote domain, "by value" mode allows to implement such operations as Cut, Copy, Paste, Save and Load with ease.
  • .NET 4.0 / VS.NET 2010 / PostSharp 2.0 - by obvious reasons. This also involves some changes in Xtensive.Core - e.g. Tuple type will be seriously refactored (we're going to get rid of code generation there). Btw, .NET 3.5 will still be supported.
The first part of this plan (N-tier and serialization) is pretty short: I hope most part of these features will be available in intermediate (v4.2.X) release. But the second is much more complex.

From this moment planned release date is confidential ;) But you can track the progress. When nearly 60-70% of work will be done, we'll announce release date.

Future goals

We postponed the following features, that were listed as candidates for inclusion to v4.3:
  • remote:// protocol. That's what must bring true N-tier to DO4 world. Likely, one of the most expected features now.
  • Sync. One more alternative for N-tier applications.
  • Security extensions. Again, many people would like to see this. But since using of some simplified security model (e.g. based on just users and roles - this can be implemented in a day or so) is almost always acceptable on initial phases, we consider this is less important when two above features.
  • ActiveRecord pattern. There is a large category of customer that simply won't use the product, if this isn't implemented. So possibly, we'll bump this feature up.
  • Global cache with open API, integration with Velocity and memcached. Obviously, this is one of the nicest optimizations we can implement. Btw, pretty simple. But since no one demands this from us right now, it is in the bottom of this list.
  • Other providers. Candidates are: MySQL, Firebird, Sybase, DB2 and our own file system provider. We're also thinking about implementing some exotic one, that doesn't have good chances to be supported by .NET ORMs, e.g. Greenplum or Berkeley DB (yes, we can - even although there is no SQL).
So these features will form a list of candidates for v4.4.

And now, the most important part:

This week it became fully clear we must port DO4 to Silverlight: Microsoft has announced Silverlight becomes primary development platform for Windows Phone 7. So now we have very strong reason to port DO4 to Silverlight. Such features as sync, remote:// protocol and integrated storage provider can make DO4 absolutely unique there. 

Btw, earlier we thought about Windows Mobile and .NET CF support, but there were too many cons - e.g. DO4 is relatively large framework, but .NET CF is designed for relatively small apps; DO4 intensively relies on runtime code generation and other features, that are simply unavailable there. So it was obvious it won't be easy to make DO4 running there. But the case we have now is way better:
  • Silverlight offers most of features we need. Minor differences related to security (access to public members only, etc.) won't be an issue.
  • Since UI there is based on WPF, it's almost obvious Windows Phone 7 hardware requirements are much stronger: it will require more RAM and faster CPU to run; I suspect 256MB RAM will be minimum now. So nearly 4MB of DO4 assemblies (~ 1Mb, if compressed) and additional expenses in runtime (it's really not a lightweight mapper) must not be an issue on mobile devices now.
So we're going to enter the market of development tools for both Silverlight and mobile platform. I'm even thinking about starting porting process right after v4.3 release - but the final decision isn't made yet. In any case, we definitely won't postpone this too much: it's better to be among first vendors there.

Stay tuned ;)

March 17, 2010

Can ORM make your application faster? Part 4: future queries and generalized batching

This is the 4th part of sequence of posts describing optimization techniques used by ORM tools.

Optimization techniques

3. Future queries

Future (or delayed) queries is an API allowing to delay an execution of particular query until the moment when its result will be necessary. 
  • If there are several future queries scheduled at this moment, they're executed alltogether as a single batch. 
  • If regular query is going to be executed, but there are scheduled future queries, they're also executed along with it (i.e. in a single batch).
So the main benefit of this optimization is, again, reduction of number of roundtrip to thr database server, or, simply, reduction of chattiness between ORM and DB.

This feature is pretty rare in ORM products, but e.g. NHibernate implements it. Future query API for DataObjects.Net is described in appropriate section of its Manual (take a look, if you're interested in example with underlying SQL).

Implementing future queries using plain SQL is pretty complex, if you don't know the exact sequence of queries, that are planned to be executed further (again, pretty frequent case). The issues are nearly the same as with CRUD batching, but, here you must additionally care about the results as well.

4. Generalized batching

That's my favorite part: the "generalized batching" term itself is my own invention. The description is actually very simple: it is a combination of above two optimizations (future queries + CRUD sequence batching). It's a case, when ORM is capable of combining batches from:
  • Delayed CRUD statements
  • Delayed future queries
  • And the query, which result must be provided immediately (i.e. requested by an applciation right now)
The goal is, again, to reduce the chattiness. When this optimization is implemented, estimated number of batches sent per each transaction (or the number of roundtrips to the database server) is nearly equal to 
  • 1 for beginning the transaction (it is actually can be joined by subsequent command by underlying provider)
  • N batches, where N is the number of queries in transaction requiring the result immediately
  • Possibly, 1 for flusing the "tail" (last unflushed batch)
  • 1 for comitting the transaction.
Or, shortly, Q+C, where Q is the number of queries in transaction requiring the result immediately, and C is constant. This is much better than the same C + (count of CRUD statements) + (count of queries) that you have in normal case.

AFAIK, DataObjects.Net is currently the only ORM implementing this optimization. Recently I wrote a post, where it was employeed, but the case was more tricky than it initially seemed. Anyway, the screenshot from that post showing an example of batch containing both CRUD statements and regular query is on the right side.

Implementing this optimiziation using plain SQL is hell - the picture on the right perfectly illustrates this. At least, you need your own single-point API passing all the queries and CRUD statements thought it to make this work.

Return to the first post of this set (introduction and TOC is there).

Migration to Mercurial is finished

Take a look at latest changes in DO4 repository. We started to use branches & tags - this means we've finished our internal migration to Mercurial.

New agreements:
  • "default" branch is where all new features appear. So currently it is "pre-v4.3".
  • "v4.2.0" branch is the latest released version of DO4 (v4.2) with all the bugfixes. So to implement a bugfix, we modify this branch, and further merge the changes to "default" branch.
  • "stable" tag marks latest stable revision. We manually move it forward in the latest stable branch (currently - "v4.2.0"), if test results on build server are good.
So it must be clear how these agreements will work in future: on release of v4.3.0, "v4.3.0" branch will appear, "stable" tag will be applied to the root revision of this branch. After this moment "v4.2.0" branch will be updated only if we know there are some customers that can't migrate to v4.3.0, and they face an issue that must be solved ASAP.

Command line tips:
  • To clone (checkout) the repository, type: hg clone https://dataobjectsdotnet.googlecode.com/hg/ DataObjects.Net 
  • To pull the latest changes, type: hg pull
  • To update to the latest stable revision, type: hg up stable
  • To update to the latest revision in "v4.2.0" branch, type: hg up v4.2.0
  • To update to the latest development revision, type: hg up default
  • To update to the latest revision in your current branch, type: hg up
Another consequence of this step is that now you can push your own changes to this repository. If this idea will come to your mind some day, please contact us to get push permissions. Obviously, we'll review all of such changes.

March 16, 2010

Can ORM make your application faster? Part 3: CRUD sequence batching

This is the 3rd part of sequence of posts describing optimization techniques used by ORM tools.

Optimization techniques

2. CRUD sequence batching

This is one of the most widely adopted optimizations, that are automatically employed by ORM tools. Instead of sending each INSERTUPDATE or DELETE command separately, they're sent to database server in batches (normally - by 20...30 commands in each batch).

Batches are either regular SQL commands, where individual statements are separated special separator character (normally ";"), or regular SQL command objects "glued" together with provider-specific API. 

As is was mentioned, the main advantage of this optimization is that it happens completely automatically, so generally, you shouldn't care about this at all to get all the benefits of its presence for free.

This optimization is currently available in most of leading ORM products (although e.g. Entity Framework and LINQ to SQL do not implement this).

Implementing it using plain SQL is pretty complex, if you don't know the exact sequence of CRUD commands that are planned to be executed further (pretty frequent case - normally there some "if"s), and thus you can't build such a batch by a simple template. I'll list just few complexities here:
  • You don't know the size of batch. But there are certain limits you must enforce, e.g. query parameter count limit and query text size limit.
  • If parameters are used in regular SQL batches, you must ensure they're uniquely named there.
  • It's resonable to send neither short nor large batches. Short batches increase chattiness, large batches are executed much later then they could, that's generally bad as well. So you must care about splitting the CRUD sequence to such "average" batches.
  • Finally, you must know the moment until which you may delay execution of the current batch. Normally the batch must be flushed, if it might affect on result of operation, that is planned to be executed now (e.g. query). ORM precisely knows this moment, because all queries go "through" it, and thus it is able to make a decision. But to achieve the same in regular application, you must have a similar single-point query API (ideally, with analyzer ;) ).
Return to the first post of this set (introduction and TOC is there).

Can ORM make your application faster? Part 2: addressing SELECT N+1 problem

Optimization techniques

1. Addressing SELECT N+1 problem: prefetch API / projections in query language

You face SELECT N+1 problem, if you use nearly this code (note: all the code below is based on DataObjects.Net APIs):
var bestOrders = (
  from order in Query.All<Order>()
  orderby order.TotalPrice descending
  select order)
  .Take(100);
foreach (var order in bestOrders) {
  Console.WriteLine("TotalPrice: {0}, Customer: {1}",
    order, // Available
    order.Customer); // Not available, will lead to a query!
}

In case with DataObjects.Net, there are two options allowing to deal with it:

a) You can change the query itself to make LINQ translator to join the necessary relationship(s):
var bestOrders = (
  from order in Query.All<Order>()
  orderby order.TotalPrice descending
  select new {Order = order, Customer = order.Customer})
  .Take(100);
foreach (var item in bestOrders) {
  var order = item.Order;
  Console.WriteLine("TotalPrice: {0}, Customer: {1}",
    otder,
    order.Customer); // or item.Customer ;)
}

Underlying SQL query will contain an additional LEFT OUTER JOIN fetching necessary relationship.

Note: DO4 will use JOIN only if you fetch one-to-one relationships by this way. For one-to-many relationships fetched in such projections (.Select(...)) it uses batches with future queries (described further) instead of JOINs. This is done to avoid excessive growth of result set sent by DB to ORM  in case when several relationships are fetched by this way (ORM gets ~ a cartesian product of rows), but this might be changed in future. Likely, we'll fetch first one-to-many relationship using JOIN in future, and all the others by the same way as now.

b) Alternatively, you can use prefetch API:
var bestOrders = (
  from order in Query.All<Order>()
  orderby order.TotalPrice descending
  select order)
  .Take(100)
  .Prefetch(order => order.Customer);
foreach (var order in bestOrders) {
  Console.WriteLine("TotalPrice: {0}, Customer: {1}",
    otder,
    order.Customer); // or item.Customer ;)
}

Prefetch API operates differently in each ORM - e.g. it may rely on JOINs to original query, or may employ future queries and local collections in queries to do its job, but the idea behind is to fetch a graph of objects that is expected to be processed further with minimal count of queries (or, better, in minimal time).

Prefetch API is currently expected to be implemented in any serious ORM product.

Implementing the same using plain SQL is actually pretty normal for developers employing SQL. SQL design requires explicit description of what you're fetching (the only exception is SELECT *), so moreover, in most of cases this is simply inevitable here.

So SQL pushes developers to explicitly optimize the code from this point. The cost is tight dependency of resulting code on database schema - with all the consequences (more complex refactoring, etc.).

Return to the first post of this set (introduction and TOC is there).

Can ORM make your application faster? Part 1: overview of optimization cases.

I decided to write a set of posts covering various optimization techniques used by ORM tools after writing a reply to this question at StackOverflow. I'd like to break a myth that ORM usage may (and, as many think, will) slow down your application.

In particular, I'm gong to:
  • Describe most interesting optimization techniques used by ORM tools I know;
  • Estimate the complexity of getting the same optimization implemented with "plain SQL" case.
This post provides an overview of optimization cases ORM developers usually care about, so it defines the plan for my further writings.

Optimization cases

From the point of developer, there are following optimization cases he must deal with:

1. Reduce chattiness between ORM and DB. Low chatiness is important, since each roundtrip between ORM and database implies network interaction, and thus its length varies between 0.1ms and 1ms at least, independently of query complxity. Note that may be 90% of queries are normally fairly simple.

Particular case is SELECT N+1 problem: if processing of each row of some query result requires an additional query to be executed (so 1 + count(...) queries are executed in total), developer must try to rewrite the code in such a way that nearly constant count of queries is executed.

CRUD sequence batching and future queries are other examples of optimization reducing the chattines (described below). Caching API is extreme case of such optimizations, especially when cache isn't safe from the point of transaction isolation (i.e. it is allowed to provide "stale" results).

2. Reduce the size of result sets. Obviously, getting more data then you really need is bad - DB will spend unnecessary time on preparing it, moreover, it will be sent over network, so such an unnecessary activity will affect on other running queries by "eating" a part of available bandwith.

An APIs allowing to limit query result size (SELECT TOP N ... in SQL) and deliver subsequuent parts of result on-demand (MARS in SQL Server 2005 and above) are intended to deal with this issue. Lazy loading is another well-known example of this optimization: fetching large items (or values - e.g. BLOBs) only on demand is perfect, if you need them rarely.

3. Reduce query complexity. Usually ORM is helpless here, so this is solely a developer's headache.

For example, an API allowing to execute SQL commands directly is partially intended handle this case.

4. Runtime optimizations. It's bad when ORM itself isn't optimized well. 

Any ineffeciences at ORM or application side increase transaction length, and thus increase the database server load.

Particular example of runtime optimization is support of compiled queries. Just imagine: translation of simplest LINQ  query like "fetch an entity by its key" in Entity Framework requires nearly 1ms. This time is nearly equal to 10 roundtrips to local database server, i.e. to 10 similar queries, if they'd be compiled. So involvement of just this optimization may reduce the duration of transaction by up to 10 times.

March 15, 2010

Are our developers experienced enough?

I'm absolutely sure you don't know that:



// That's our designer's reply to "Our designers have more than 8 years experience (each) in the field of web graphics." statement.

Joel about Twitter

Just read "Puppy!" post by Joel and found this great part about Twitter there:

"Although I appreciate that many people find Twitter to be valuable, I find it a truly awful way to exchange thoughts and ideas. It creates a mentally stunted world in which the most complicated thought you can think is one sentence long. It’s a cacophony of people shouting their thoughts into the abyss without listening to what anyone else is saying. Logging on gives you a page full of little hand grenades: impossible-to-understand, context-free sentences that take five minutes of research to unravel and which then turn out to be stupid, irrelevant, or pertaining to the television series Battlestar Galactica. I would write an essay describing why Twitter gives me a headache and makes me fear for the future of humanity, but it doesn’t deserve more than 140 characters of explanation, and I’ve already spent 820."

I dislike this flaming place too. And yes, although 140 characters is definitely more then a typical phrase of Elle the Cannibal from The Twelve Chairs (an woman who has a vocabulary of only 30 words), it is far less then I'd like to have for expressing myself.

So... Twitter is good, but not for me. And I was glad to hear one more approval that not just I think so ;)

March 11, 2010

Minor website update

We've changed our web site a bit once more: front page and Products page were face-lifted; registration now is simplified as well.

And finally, we've added an annoying confirmation:

New sections in manual: DisconnectedState and Optimistic Concurrency Control

See:
I also added Key and VersionInfo types to persistent types diagram.

March 9, 2010

Offtopic for my blog: "Online web help: industry standards review"

That's my first attempt to write an article for wide audience, that touches Help Server. If it will be interesting for you, don't forget to dugg it.

Link: http://digg.com/d31L0KF

March 5, 2010

Simple audit logging with DO4: code + some analysis of DO4 brains :)

I'm slowly progressing with adding more samples to Manual project, and today I added a really nice (as I think) example of DO4 usage. It's an IModule adding audit logging to the Domain.

Features involved:
  • Automatic IModule registration and typical usage pattern
  • Automatic registration of instances of open generics (there are AuditRecord<T> and SyncInfo<T>, the second one is used just to dump its actually registered instances)
  • Binding to Session events to get notified on Entity changes and transaction-related events
  • Using IHasExtensions interface - likely, you don't know it is implemented on Domain, Session and Transaction. This sample shows when it is really helpful.
  • Gathering change info and creating audit records on transaction commit.
  • Some conditions that might be interesting to such services are tested there, e.g. if there is a nested transaction (Transaction.IsNested), or if DisconnectedState is attached to Session (Session.IsDisconnected).
  • Prefetch API usage. Look for "Prefetch" there. Btw, I use it with .Batch() in TransactionCommitting(...) handler - to split the original sequence to relatively small parts (1000 items) and then prefetch & process each of them. If Session uses weak cache (now it can use strong as well), this helps to avoid possible cache misses after prefetch by significantly reducing the probability of GC there.
  • New way of making DO4 to use a dedicated KeyGenerator instance for TransactionInfo.Id property - by default, DO4 uses one KeyGenerator for each key type, so they are shared. Earlier almost identical case was discussed here, and it was really ugly. Now it's much less ugly, but actually, I still dislike the idea of declaring a special type for this (alternative is to make the same by adding IoC configuration in App.config, but I finally preferred to define named service in code using an attribute).
The test was inspired by idea to show how open generics registration work in DO4 and this post. The audit logging  used here is pretty simple, but I think it's good enough for sample; if you're interested in more complex one, try to use OperationCapturer - it is used by DisconnectedState and designed for exactly this purpose.

What I like in this sample:
  • To get all this stuff working in any Domain, you should just ensure the SimpleAuditModule and AuditRecord<T> types are registered there - that's it (so if they're in a single assembly, you must simply register it). Convention over configuration!
  • We implemented a general purpose logic (audit) and attached it to Domain it knows nothing about. Unification and decoupling.
  • Doing the same with SQL would involve much more complex code - just imagined the hell with triggers  on all the tables :) Some stuff, like AuditRecord.EntityKey and AuditRecord.EntityAsString properties, are simply unimaginable there. So we use OOP features to make it simpler.
AFAIK, this is the first sample showing this kind of logic. I shown SyncInfo<T> there to indicate sync can be implemented this way. The same is about security - just add new assembly, and you've got it. Imagine this with any other  ORM tool :)

Brief overview of file structure:
Sample output:
Automatically registered generic instances:
  AuditRecord<Animal>
  AuditRecord<AuditRecord>
  AuditRecord<Person>
  AuditRecord<TransactionInfo>
  SyncInfo<Animal>
  SyncInfo<AuditRecord>
  SyncInfo<Person>
  SyncInfo<TransactionInfo>

Audit log:

Transaction #1 (04.03.2010 22:23:47, 203ms)
  Created Cat Tom, Id: #1, Owner: none
          Current state: Cat Tom, Id: #1, Owner: Alex
  Created Dog Sharik, Id: #2, Owner: none
          Current state: Dog Sharik, Id: #2, Owner: none

Transaction #2 (04.03.2010 22:23:47, 15ms)
  Created Cat , Id: #3, Owner: none
          Current state: removed Cat, (3)

Transaction #3 (04.03.2010 22:23:47, 93ms)
  Changed Cat Musya, Id: #3, Owner: none
          Current state: removed Cat, (3)

Transaction #4 (04.03.2010 22:23:47, 31ms)
  Removed Cat, (3)
          Current state: removed Cat, (3)

Transaction #5 (04.03.2010 22:23:47, 78ms)
  Created Person Alex, Id: #4, Pets: 
          Current state: Person Alex, Id: #4, Pets: Tom

Transaction #6 (04.03.2010 22:23:47, 31ms)
  Changed Cat Tom, Id: #1, Owner: Alex
          Current state: Cat Tom, Id: #1, Owner: Alex
  Changed Person Alex, Id: #4, Pets: Tom
          Current state: Person Alex, Id: #4, Pets: Tom

Note: transaction durations are significantly exaggerated because of JITting here (test isn't repeating). To be sure, I tried to repeat a part of it in a loop - 90% of transactions were showing 0ms duration.

A bit of analysis

Here is SQL Profiler screenshot showing the batch executed during the last updating transaction:


Can you find out why this batch was "flushed"?

Batch "flush" happens when someone (query) needs a result right now. If you take a look at red SQL there, you'll find the last statement there (SELECT) looks like EntitySet loading. Acutally, the batch was executed because Person.ToString() method invoked during building one of AuditRecords started to enumerate Person.Pets (EntitySet), which was not yet loaded...

"Wait, this is impossible - just before this moment we added an item to this collection (by setting paired association), so it must already be loaded!" - that's what come to my mind first :)

But after thinking a bit more I discovered this behavior is correct: we have one more optimization allowing EntitySet to do not load its state (actually, its isn't simply state loading anyway), if it is precisely known an item that we adding it can be added to it without any check (the simplest case: a item is is newly created entity). In this case it was precisely known, because Person and Animal state was already fetched (two preceding SELECT queries in log did this).

So adding an item didn't lead to EntitySet loading; this happened only on subsequent Person.ToString() call executed by audit logger.

Isn't DO4 a smart guy? ;)

You should note as well that .Prefetch() doesn't lead to any queries, if everything it needs is already loaded. So in this case all my stuff with .Batch() / .Prefetch() did nothing (actually, it did - it performed all the checks). But if transaction would be long, may be it could help ;) (actually, very rare case, so it's better to remove it from TransactionCommitting(...) or use it only if its duration is relatively big ;) ).

Ok, I'm going to get some sleep now - hopefully, the article was helpful. Questions, etc. are welcome, but I'll be able to reply only in 6-8 hours :)

Have a good night / day!

March 4, 2010

Default logging mode in DO4 is changed

Until upcoming today's v4.2 update DO4 was logging everything to Debug output (Debug -> Windows -> Debug output in VS.NET), if debugger was attached to process on its startup. I just disabled this - we've got some complains this affects on performance + floods Debug output.

So now logging is off by default. You must modify your App.config (or Web.config) to enable it:
<configuration>
  <configSections>
    <section name="Xtensive.Core.Diagnostics" type="Xtensive.Core.Diagnostics.Configuration.ConfigurationSection, Xtensive.Core" />
  </configSections>
  <Xtensive.Core.Diagnostics>
    <logs>
      <!-- Logging all Storage events to console, comprehensive format: -->
      <log name="Storage" provider="Console" format="Comprehensive" />
      <!-- Logging only some events from Storage to file, simple format:
      <log name="Storage" events="Warning,Error,FatalError" 
        provider="File" fileName="Storage.log" format="Simple" />
      -->
      <!-- Logging only some events from Storage to Debug output:
      <log name="Storage" events="Warning,Error,FatalError"
        provider="Debug" format="Simple" /> 
      -->
    </logs>
  </Xtensive.Core.Diagnostics>
</configuration>

The most comprehensive logging to Debug output can be turned on by this configuration:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <configSections>
    <section name="Xtensive.Core.Diagnostics" type="Xtensive.Core.Diagnostics.Configuration.ConfigurationSection, Xtensive.Core" />
  </configSections>
  <Xtensive.Core.Diagnostics>
    <logs>
      <log provider="Debug" format="Comprehensive" />
    </logs>
  </Xtensive.Core.Diagnostics>
</configuration>

Hopefully, this will resolve the issue. Refer to description of logging framework for further details.

P.S. DO v4.2 installers will be updated shortly.

Thoughts: what must be done ASAP to improve v.4.2?

Few days have passed after v4.2 release, so far I made the following conclusions:

DisconnectedState needs manual + more advanced WPF sample. Probably, we must also add few features to it to make it more usable (based on advanced sample).

"Really good sample" means:
  • It must be an MDI application working with multiple Sessions and DisconnectedStates.
  • It must behave as it is expected from typical business application.
  • It must browse large result set utilizing LINQ in DO4 for paging / sorting / filtering.
  • The sample must be usable as template for similar applications. I.e. we must not trade between immediate clearness (= certain amount code duplication, etc.) and practical usability of code we use there. So e.g. if necessary, we must add any helper types for routine tasks.
  • Ideally, we must show WPF client <-> middle-tier server case.
  • Possibly, we must use some well-known third-party grid with rich feature set.
New Legacy mode needs manual + extensions allowing to use it in wider set of cases. 

Currently I keep in mind the following extensions we may implement:
  • Views support. Must be pretty easy - AFAIK we already can extract their structure info (columns + types), so we must ~ add "read only" flag to some of objects in domain model, extend attributes & support this in builder.
  • Few mode upgrade modes / options: no validation at all, model caching (Domain must start much faster in case there are 500+ types).
  • Ideally, we must add support for ActiveRecord pattern (T4 templates generating our classes from extracted database schema).
  • Possibly: support of DO v3.9 type identifiers extraction / mapping to v3.9 schemas. Ideally, with T4, but even a full description of migration process would be a big plus.
What do you think, is this reasonable? Any comments \ ideas are welcome. WPF case is of #1 priority for me now; other issues will be handled by Denis Krjuchkov and Alexey Kochetov.

March 3, 2010

Link: "Volatile keyword in C# – memory model explained"

Just read this really good article explaining .NET memory model. Highly recommended, if you're dealing with multi-threading code on .NET (which is pretty common now). I've seen a set of articles related to .NET memory model before, but this one seems the most clear one.

March 2, 2010

DataObjects.Net v4.2 is released!

DataObjects.Net v4.2 is our first official release after v4.0.5 (August 2009) - a bit more than 6 months have passed after that moment, and as you will find, we didn't waste the time during this period:

What's new
We've made a lot of other features and changes, but I hope the most important ones are listed above now. If you're interested, see this and that lists, there are all the issues we've implemented from v4.0.5 release (198 in total).

Known issues

Installer
  • Install.bat \ Uninstall.bat files must be executed with Administrator permissions on Windows Vista and above. Of course, this is important only if you'll use them. Standard installer already does this (you'll be requested to provide Administrator permissions for it by UAC prompt on its start).
Framework itself
  • Oracle support for 10g is temporarily broken in the current version, so you can use it only with Oracle 9i and 11g.
Manual
  • A set of  sections must be added, including "Disconnected operations", "Full-text search", etc.
Reference
Sandbox projects
  • OrderAccounting project is buggy. Frankly speaking, we were unable to fix all the issues there during last days. The cause is DissconnectedState key remapping behavior on ApplyChanges, which is relatively new. All the issues there will be fixed shortly; for now we recommend you to study WPF - it fully relies on DisconnectedState as well now.
Nearest plans
  • Improve Manual and samples further releasing minor updates 1-2 times per month. 
  • Migrate to .NET 4.0 and PostSharp 2.0. Or, more precisely, provide .NET 4.0 version - so far we aren't going to fully leave .NET 3.5.