September 30, 2009

Upcoming changes: local collections in LINQ queries

I'm continuing to demonstrate the magic we're working on now. Local collections are actually query parameters of IEnumerable<T> type.

Let's take a look at their usage in, likely, the simplest case:
var query = 
  from o in Query<Order>.All
  where
    (new {"Alex", "Alexey", "Dmitry"})
    .Contains(o.Customer.FirstName)
  select o;
As you suspect, such a query will be translated to SQL with IN operator.

But what about this one:
var query = 
  from o in Query<Order>.All
  where
    (from n in Enumerable.Range(1, 100000)
    select n.ToString())
    .Contains(o.Customer.FirstName)
  select o;
What SQL would you expect to see behind the scenes in this case? Will it work at all?

Ok, now I'm going to astonish you completely:
var bestOrdersQuery = 
  from c in Query<Order>.All
  let price = o.Details.Sum(o => o.UnitPrice * o.Quantity)
  order by -totalPrice
  select new {
    Order = o, 
    Price = price
  };

best10KOrders = bestOrdersQuery.Take(10000).ToList();

var query = 
  from bo in Query.Store(best10KOrders) // Note that this is List<anonymoustype>!
  from c in Query<Customer>.All
  where c.Orders.Contains(bo.Order)
  select new {Customer = c, Price = bo.Price} into pair
  group pair by pair.Customer into g
  select new { Customer = g.Key, PaidInBest10KOrders = g.Sum(i => i.Price) }
This query will work as well.

All I wrote implies our query engine is able to:
  • Store almost any collection into a temporary table before query execution
  • Maintain mapping for its items: you can use any properties of items in such collections inside LINQ queries
  • Use IN in SQL instead of temporary table when this is possible. This significantly depends on provider. E.g. SQL Server does not support tuples in IN, but PostgreSQL does, so in case with SQL Server we'll be able to represent only collections of primitive types or keys by this way. There are other limitations we're going to consider, e.g. maximal parameter count and maximal comparison operations count per query.
Consequently, it is possible to use items of such collections as any other type supported by our LINQ translator. E.g. you can join such a collection, group it by some property, calculate an aggregate and so on. The only exception is trying to get  types we can't materialize in final select clause:
var bestOrdersQuery = 
  from c in Query<Order>.All
  let price = o.Details.Sum(o => o.UnitPrice * o.Quantity)
  order by -totalPrice
  select new {
    Order = o, 
    Price = price
  };

best10KPairs = 
  bestOrdersQuery.Take(10000).ToList()
  .Select(i => new Pair<Order, double>(i.Order, i.Price).ToList();

var query = 
  from p in Query.Store(best10KPairs) // Note that we store Pair<Order, double> here!
  select p; // Will not work, since actually we don't know how to materialize Pair<Order, double>
In this case we simply don't know how to materialize Pair&;ltOrder, double>, because we never seen its construction. On the other hand, this query will work:
var query = 
  from p in Query.Store(best10KPairs) // Note that we store Pair<Order, double> here!
  select new Pair<Order, double>(p.First, p.Second);
Why I think such complex implementation of this feature is really necessary and attractive? 
  • Think about upcoming integration with full-text search engines. Results they return can be very large, and ideally we must be able to pass them to further processing on RDBMS in any reasonable case. I'm not sure if you know this or not, but v3.9 was able to process up to 1K results returned by Lucene.Net in case it was used per each query.
  • Prefetch is one more nice application of this feature, although IN optimization is more desirable here than a version with temporary table.
  • Finally, if there will be executable DML queries some day, this feature might help a lot here as well.
Final remarks:
  • Likely, initially there will be some minor lacks. E.g. IN optimization might not work in v4.1. I simply not sure if we'll be able to complete this.
  • But we'll resolve them in v4.1.1 ;)