The preceding steps introduced clustered and nonclustered indexes, highlighting the following aspects of each:
- There is always one entry in the index for each row in the table (we noted that an exception to this rule will be covered in a higher step). These entries are always in index key sequence.
- In a clustered index, the index entry is the actual row of the table.
- In a nonclustered index, the entry is separate from the data row; and consists of the index key columns and a bookmark value to map the index key columns to an actual row of the table.
The last half of the previous sentence is correct but incomplete. In this Step we examine the option to include additional columns to a non-clustered index, called included columns. In Step 6, which examines Bookmark operations, we will see that SQL Server might unilaterally add some columns to your index.
Included Columns
Columns that are in a nonclustered index, but are not part of the index key, are called included columns. These columns are not part of the key, and so do not impact the sequence of entries in the index. Also, as we will see, they cause less overhead than key columns.
When creating a nonclustered index, we specify the included columns separately from the key columns; as shown in Listing 5.1.
CREATE NONCLUSTERED INDEX FK_ProductID_ ModifiedDate
ON Sales.SalesOrderDetail (ProductID, ModifiedDate)
INCLUDE (OrderQty, UnitPrice, LineTotal)
Listing 5.1: Creating a nonclustered index with included columns
In this example, ProductID and ModifiedDate are the index key columns, and OrderQty,UnitPrice and LineTotal are the included columns.
Had we not specified the INCLUDE clause in the above SQL statement, the resulting index would have looked like this:
ProductID ModifiedDate Bookmark
Page n:
707 2004/07/25 =>
707 2004/07/26 =>
707 2004/07/26 =>
707 2004/07/26 =>
707 2004/07/27 =>
707 2004/07/27 =>
707 2004/07/27 =>
707 2004/07/28 =>
707 2004/07/28 =>
707 2004/07/28 =>
707 2004/07/28 =>
707 2004/07/28 =>
707 2004/07/28 =>
Page n+1:
707 2004/07/29 =>
707 2004/07/31 =>
707 2004/07/31 =>
707 2004/07/31 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
708 2001/07/01 =>
However, having told SQL Server to include the OrderQty, UnitPrice and LineTotal columns, the index looks like this:
:- Search Key Columns -: :--- Included Columns ---: : Bookmark :
ProductID ModifiedDate OrderQty UnitPrice LineTotal
Page n-1:
707 2004/07/29 1 34.99 34.99 =>
707 2004/07/31 1 34.99 34.99 =>
707 2004/07/31 3 34.99 104.97 =>
707 2004/07/31 1 34.99 34.99 =>
708 2001/07/01 5 20.19 100.95 =>
Page n:
708 2001/07/01 1 20.19 20.19 =>
708 2001/07/01 1 20.19 20.19 =>
708 2001/07/01 2 20.19 40.38 =>
708 2001/07/01 1 20.19 20.19 =>
708 2001/07/01 2 20.19 40.38 =>
708 2001/12/01 7 20.19 141.33 =>
708 2001/12/01 1 20.19 20.19 =>
708 2002/01/01 1 20.19 20.19 =>
708 2002/01/01 1 20.19 20.19 =>
708 2002/01/01 1 20.19 20.19 =>
Page n+1:
708 2002/01/01 2 20.19 40.38 =>
708 2002/01/01 5 20.19 100.95 =>
708 2002/02/01 1 20.19 20.19 =>
708 2002/02/01 1 20.19 20.19 =>
708 2002/02/01 2 20.19 40.38 =>
Examining the contents of this index shown, it is apparent that the rows are ordered by the index key columns. The five rows for product 708 with a modified date of January 1, 2002 (highlighted in bold), for example, are contiguous in the index, as are the rows every other ProductID/ModifiedDate combination.
You might ask “Why even have included columns? Why not simply add OrderQty,UnitPrice and LineTotal to the index key?” There are several advantages in having these columns in the index but not in the index key, such as:
- Columns that are not part of the index key do not affect the location of the entry within the index. This, in turn, reduces the overhead of having them in the index. For instance, if the ProductID or ModifiedDate value in the row is modified, then that row’s entry must be relocated within the index. But, if the UnitPricevalue in the row is modified, the index entry still needs to be updated, but it does not need to be moved.
- The effort required to locate an entry(s) in the index is less.
- The size of the index will be slightly smaller.
- The data distribution statistics for the index will be easier to maintain.
Most of these advantages will be more meaningful in higher Steps, when we look at the internal structures of indexes and at some additional information that is maintained by SQL Server for optimizing query performance.
Deciding whether an index column is part of the index key, or just an included column, is not the most important indexing decision you will ever make. That said, columns that frequently appear in the SELECT list but not in the WHERE clause of a query are best placed in the included columns portion of the index.
On Becoming a Covering Index
In Step 4, we expressed agreement with the designers of the AdventureWorksdatabase regarding their decision to make SalesOrderID/SalesOrderDetailID the clustered index of the SalesOrderDetail table. Most queries against this table will request data ordered or grouped by sales order number. However, a number of queries, perhaps from the warehouse staff, will need information in product sequence. These are the queries that will benefit from the index shown in Listing 5.1.
To illustrate the potential benefit of having the included columns in that index, we will look at two queries against the SalesOrderDetailtable, each of which we will execute three times, as follows:
- Run 1: No nonclustered index
- Run 2: Using a nonclustered index that contains no included columns (only the two key columns)
- Run 3: Using the nonclustered index as defined in Listing 5.1
As we have done in previous Steps, we again use number of reads as the primary metric, but we also use SQL Server Management Studio’s “Display Actual Execution Plan” option to view the plan for each execution. This will give us an added metric: the percentage of the workload that was spent on non-read activity, such as matching up related data after they have been read into memory. This gives us a better understanding of the total cost of the query.
Testing the First Query: Activity totals by product
Our first query, shown in Listing 5.2, is one that provides activity totals by date for a specific product.
SELECT ProductID ,
ModifiedDate ,
SUM(OrderQty) AS 'No of Items' ,
AVG(UnitPrice) 'Avg Price' ,
SUM(LineTotal) 'Total Value'
FROM Sales.SalesOrderDetail
WHERE ProductID = 888
GROUP BY ProductID ,
ModifiedDate ;
Listing 5.2: The "Activity totals by product" query
Since indexes can impact the performance of a query, but not the results; executing this query against the three different indexing schemes always yields the following row set:
ProductID ModifiedDate No of Rows Avg Price Total Value
----------- ------------ ----------- -----------------------------
888 2003-07-01 16 602.346 9637.536000
888 2003-08-01 13 602.346 7830.498000
888 2003-09-01 19 602.346 11444.574000
888 2003-10-01 2 602.346 1204.692000
888 2003-11-01 17 602.346 10239.882000
888 2003-12-01 4 602.346 2409.384000
888 2004-05-01 10 602.346 6023.460000
888 2004-06-01 2 602.346 1204.692000
The eight rows of output are aggregated from the thirty nine ‘ProductID = 888’ rows in the table to give one output row for each date that had one-or-more ‘ProductID = 888’ sales.The basic scheme for conducting our test is shown in Listing 5.3. Before you run any queries, make sure you run SET STATISTICS IO ON.
IF EXISTS ( SELECT 1
FROM sys.indexes
WHERE name = 'FK_ProductID_ModifiedDate'
AND OBJECT_ID = OBJECT_ID('Sales.SalesOrderDetail') )
DROP INDEX Sales.SalesOrderDetail.FK_ProductID_ModifiedDate ;
GO
--RUN 1: Execute Listing 5.2 here (no non-clustered index)
CREATE NONCLUSTERED INDEX FK_ProductID_ModifiedDate
ON Sales.SalesOrderDetail (ProductID, ModifiedDate) ;
--RUN 2: Re-execute Listing 5.2 here (non-clustered index with no include)
IF EXISTS ( SELECT 1
FROM sys.indexes
WHERE name = 'FK_ProductID_ModifiedDate'
AND OBJECT_ID = OBJECT_ID('Sales.SalesOrderDetail') )
DROP INDEX Sales.SalesOrderDetail.FK_ProductID_ModifiedDate ;
GO
CREATE NONCLUSTERED INDEX FK_ProductID_ModifiedDate
ON Sales.SalesOrderDetail (ProductID, ModifiedDate)
INCLUDE (OrderQty, UnitPrice, LineTotal) ;
--RUN 3: Re-execute Listing 5.2 here (non-clustered index with include)
Listing 5.3: Testing the "Activity totals by product" query
The relative effort required to execute the query against each indexing scheme is shown in Table 5.1.
Run 1: No Nonclustered Index | Table 'SalesOrderDetail'. Scan count 1, logical reads 1238. Non read activity: 8%. |
Run 2: Index – No Included Columns | Table 'SalesOrderDetail'. Scan count 1, logical reads 131. Non read activity: 0%. |
Run 3: With Included Columns | Table 'SalesOrderDetail'. Scan count 1, logical reads 3. Non read activity: 1%. |
Table 5.1: Results of running the first query three times with different nonclustered indexes available
As you can see from these results:
- Run 1 required a complete scan of the SalesOrderDetail table; every row had to be read and examined to determine if it should participate in the result or not.
- Run 2 used the nonclustered index to quickly find the bookmarks for just the 39 requested rows, but it had to retrieve each of those rows individually from the table.
- Run 3 found everything that it needed in the nonclustered index, and in the most advantageous sequence – ModifiedDate withinProductID. It jumped quickly to the first requested entry, read 39 consecutive entries, doing the aggregate calculations for each entry as it was read, and was done.
Testing the Second Query: Activity totals based on Date
Our second query is identical to the first, except for a change in the WHERE clause. This time the warehouse is requesting information based on date, rather than on product. We must filter on the right-most search key column, ModifiedDate; rather than the left-most column,ProductID. The new query is shown in Listing 5.4.
SELECT ModifiedDate ,
ProductID ,
SUM(OrderQty) 'No of Items' ,
AVG(UnitPrice) 'Avg Price' ,
SUM(LineTotal) 'Total Value'
FROM Sales.SalesOrderDetail
WHERE ModifiedDate = '2003-10-01'
GROUP BY ModifiedDate ,
ProductID ;
Listing 5.4: The "Activity totals by date" query
The resulting row set, in part, is:
ProductID ModifiedDate No of Items Avg Price Total Value
----------- ------------ ----------- --------------------- ----------------
:
:
782 2003-10-01 62 1430.9937 86291.624000
783 2003-10-01 72 1427.9937 100061.564000
784 2003-10-01 52 1376.994 71603.688000
792 2003-10-01 12 1466.01 17592.120000
793 2003-10-01 46 1466.01 67436.460000
794 2003-10-01 37 1466.01 54242.370000
795 2003-10-01 22 1466.01 32252.220000
:
:
(164 row(s) affected)
The WHERE clause filtered the table down to 1492 qualifying rows; which, when grouped, produced the 164 rows of output.
To run the tests, follow the same scheme as described in Listing 5.3, but using the new query from Listing 5.4. The results are The relative effort required to execute the query against each indexing scheme is shown in Table 5.2.
Run 1: No Nonclustered Index | Table 'SalesOrderDetail'. Scan count 1, logical reads 1238. Non read activity: 10%. |
Run 2: With Index – No Included Columns | Table 'SalesOrderDetail'. Scan count 1, logical reads 1238. Non read activity: 10%. |
Run 3: With Included Columns | Table 'SalesOrderDetail'. Scan count 1, logical reads 761. Non read activity: 8%. |
Table 2: Results of running the second query three times with different nonclustered indexes available
Both the first and second test resulted in the same plan; a complete scan of theSalesOrderDetail table. For reasons that were covered in detail in Step 4, the WHERE clause was not sufficiently selective to benefit from a non-covering index. Also, the rows comprising any one group are scattered throughout the table. As the table was being read, each row had to be matched to its group; and operation that consumes processor time and memory.
The third test found everything that it needed in the nonclustered index; but, unlike the previous query, it did not find the rows located contiguously within the index. he rows that comprise each individual group are contiguous within the index; but the groups themselves are scattered over the length of the index. Therefore, SQL Server scanned the index.
Scanning the index instead of the table had two advantages:
- The index is smaller than the table, requiring less reads.
- The rows were already grouped, requiring less non read activity.
Conclusion
Included columns enable nonclustered indexes to become covering indexes for a variety of queries, improving the performance of those queries; sometimes quite dramatically. Included columns increase the size of an index, but add little else in terms of overhead. Any time you are creating a nonclustered index, especially on a foreign key column, ask yourself – “What additional columns should I include in this index?”
No comments:
Post a Comment