SQL Server 3种联接操作

表联接(Table JOIN) 是结构化查询语句(SQL)中应用非常广泛的操作。SQL Server 引入了3种联接操作(英文介绍来自微软官方网站:http://msdn.microsoft.com/en-us/library/ms191426%28v=SQL.100%29.aspx):

1. Nested loops joins

If one join input is small (fewer than 10 rows) and the other join input is fairly large and indexed on its join columns, an index nested loops join is the fastest join operation because they require the least I/O and the fewest comparisons.

The nested loops join, also called nested iteration, uses one join input as the outer input table (shown as the top input in the graphical execution plan) and one as the inner (bottom) input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.

In the simplest case, the search scans an entire table or index; this is called anaive nested loops join. If the search exploits an index, it is called anindex nested loops join. If the index is built as part of the query plan (and destroyed upon completion of the query), it is called atemporary index nested loops join. All these variants are considered by the query optimizer.

A nested loops join is particularly effective if the outer input is small and the inner input is preindexed and large. In many small transactions, such as those affecting only a small set of rows, index nested loops joins are superior to both merge joins and hash joins. In large queries, however, nested loops joins are often not the optimal choice.

两个联接表,其中一个数据量小于10条,另外一个表数据量很大,且联接字段进行了索引,这种情况下,嵌套迭代是效率非常高的联接方式。

2. Merge joins

If both join inputs are large and the two inputs are of similar sizes, a merge join with prior sorting and a hash join offer similar performance. However, hash join operations are often much faster if the two input sizes differ significantly from each other.

The merge join requires both inputs to be sorted on the merge columns, which are defined by the equality (ON) clauses of the join predicate. The query optimizer typically scans an index, if one exists on the proper set of columns, or it places a sort operator below the merge join. In rare cases, there may be multiple equality clauses, but the merge columns are taken from only some of the available equality clauses.

Because each input is sorted, the Merge Joinoperator gets a row from each input and compares them. For example, for inner join operations, the rows are returned if they are equal. If they are not equal, the lower-value row is discarded and another row is obtained from that input. This process repeats until all rows have been processed.

The merge join operation may be either a regular or a many-to-many operation. A many-to-many merge join uses a temporary table to store rows. If there are duplicate values from each input, one of the inputs will have to rewind to the start of the duplicates as each duplicate from the other input is processed.

If a residual predicate is present, all rows that satisfy the merge predicate evaluate the residual predicate, and only those rows that satisfy it are returned.

Merge join itself is very fast, but it can be an expensive choice if sort operations are required. However, if the data volume is large and the desired data can be obtained presorted from existing B-tree indexes, merge join is often the fastest available join algorithm.

合并联接需要联接的字段建有索引,其本身的效率非常高,但是排序的代价却非常高。

3. Hash joins

Hash joins can efficiently process large, unsorted, nonindexed inputs. They are useful for intermediate results in complex queries because:

  • Intermediate results are not indexed (unless explicitly saved to disk and then indexed) and often are not suitably sorted for the next operation in the query plan.
  • Query optimizers estimate only intermediate result sizes. Because estimates can be very inaccurate for complex queries, algorithms to process intermediate results not only must be efficient, but also must degrade gracefully if an intermediate result turns out to be much larger than anticipated.

The hash join allows reductions in the use of denormalization. Denormalization is typically used to achieve better performance by reducing join operations, in spite of the dangers of redundancy, such as inconsistent updates. Hash joins reduce the need to denormalize. Hash joins allow vertical partitioning (representing groups of columns from a single table in separate files or indexes) to become a viable option for physical database design.

The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.

Hash joins are used for many types of set-matching operations: inner join; left, right, and full outer join; left and right semi-join; intersection; union; and difference. Moreover, a variant of the hash join can do duplicate removal and grouping, such as SUM(salary) GROUP BY department. These modifications use only one input for both the build and probe roles.

哈西联接具有最大适应能力,当两个表大小一致时,可以达到合并联接的效率。

基于SmartERP表结构和数据(EIIS_Product: 5万条记录,EIIS_Category: 10,EIISPara_ProductUnit: 4,EIIS_WarehouseProductList: 7),进行了一些有益测试:

1. 一般思路,执行时间 586 毫秒:SELECT * FROM EIIS_WarehouseProductList AS WP INNER JOIN EIIS_Product AS P ON WP.ProductID=P.ProductID AND WP.WarehouseID=1 INNER JOIN EIIS_Category AS C ON P.CategoryID=C.CategoryID LEFT OUTER JOIN EIISPara_ProductUnit AS PU ON P.Unit=PU.Unit WHERE P.Control>=0 AND C.Control=0 AND (C.Kind BETWEEN 2 AND 5 OR C.Kind BETWEEN 10 AND 29);

2. 因为 EIIS_WarehouseProductList 现在的数据非常少,修改与产品表的联接类型,对应执行时间为:LOOP = 0, MERGE = 33, HASH = 46。但在实际项目过程中,库存产品的数目会增长,如果10%的产品有库存数据,显示20条查询结果时间为:LOOP = 90, MERGE = 56, HASH = 106,一般思路时间接近0。如果100%的产品都有库存记录,显示20条查询结果时间为: LOOP = 3610, MERGE = 516, HASH = 620,一般思路时间20。

3. 借鉴这些结论,改写如下:SELECT * FROM EIIS_WarehouseProductList AS WP INNER MERGE JOINEIIS_Product AS P ON WP.ProductID=P.ProductID AND WP.WarehouseID=1 INNER JOIN EIIS_Category AS C ON P.CategoryID=C.CategoryID LEFT OUTER LOOP JOINEIISPara_ProductUnit AS PU ON P.Unit=PU.Unit WHERE P.Control>=0 AND C.Control=0 AND (C.Kind BETWEEN 2 AND 5 OR C.Kind BETWEEN 10 AND 29);

经过深入测试,这些结论仍需要在实际项目中变化测试,以实际运行结果为准,以便找到最佳方案,切记切记!