menu

Wednesday, June 1, 2011

Indexing Strategies

Introduction

When we run a query, the choice of whether to use an index or not is decided by the query optimizer. Using appropriate indexes on tables so that the query optimizer chooses the most efficient strategy can yield significant performance gains . We will try to examine a few scenarios to familiarize ourselves with the way indexes are used to retrieve data and also try to figure out which index clustered or non clustered is good on each occasion.
Let us create a sample table so that we can use it for our purpose.
CREATE TABLE MEMBERS
( member_number   INT NOT NULL,
  account_number   INT NOT NULL, 
  member_SSN      INT  NOT NULL,
  account_balance  MONEY NOT NULL,
  member_name     VARCHAR(30) NOT NULL,
  member_address  VARCHAR(60) NOT NULL )
We will use a number of scenarios here.

Retrieving a single row from the members table

Select account_balance from members where member_number = 9000
Let us say there is a clustered index on the member_number column.
SQL Server will first obtain the page number of the root page of the index from the sysindexes table. Then it will traverse through the index pages looking for a key value that is not greater than the current key value. In a clustered index the index consists of index key and a pointer which is the Page ID. So just one level above the leaf level it gets the page pointer to the data page holding the particular row and it scans the data page for the row containing the key.
Let us say there is a non clustered index on the member_number column.
In case of a non clustered index too the traversal works in a similar manner, the pointer to the key value we want to retrieve is found in the leaf level which is the Row ID of the data row. Using the Row ID sql server pulls the data row.
The difference here is, the non clustered index has one more logical read. The performance difference is very less between the two indexes here.

Retrieving a range of rows

Select account_balance from members where member_number between 9000 and 10000
In case of a clustered index on member_number column sql server will look again for the highest key value not greater than the lowest key we wish to retrieve. It will traverse through the index and will find the data page that contains the lowest key value which is 9000 in our case. It scans the data page for 9000 and when it finds it, it then retrieves the rows sequentially until it reaches 10000, as we know in a clustered index the data rows are in arranged in the same order as the clustered index key.
In case of a non clustered index on member_number column the traversal of the index happens in a similar way. But once the leaf level is reached and the key value is found, the index entry contains the Row ID of the data row to be found which in this case will be the Row ID for member_number 9000. Using this Row ID, sql server will pull the data page holding that row. In a similar way it has to pull all the other rows for member_numbers between 9001 and 10000. We have to note a difference here. The key values are in sequence in a non clustered index, but the data pages are not. Let us say we have 50 rows satisfying the query and approximately 10 rows are present in each data page. So sql server has to retrieve the 50 rows in 50/10 = 5 logical reads to the data pages using a clustered index and sql server will need 50 logical reads to the data pages using non clustered index. This is a significant performance difference.
So it is always better to create a clustered index when we are accessing a range of rows.

Covered queries

Select account_balance from members where member_number between 9000 and 10000
Same query as before , but this time let us say that the members table has a composite non clustered index on member_number and account_balance columns. The query is called a covered query since the items in the where clause and select clause belong to the index. The query optimizer will realize that this is a covered query. In this case the sql server does not have to go to data level to work on the query. It can find the results in the leaf level of the non clustered index. It traverses the index based on the member_number and when it reaches the leaf level , it finds the account_balance information next to it in the key itself with out having to access the data page through the pointer.
Let us say there are 500 rows that satisfy this query and there are 100 index entries per data page. So to get the results sql server has to do 5 logical reads of data pages. This is more efficient than having a clustered index on the member_number column since we do not have to access data pages at all.
So sometimes a covered non clustered index is more efficient than an equivalent clustered index.

Retrieving rows with clustered index and non clustered index on the table

Select account_balance from members where member_number between 9000 and 10000
Here we will examine a different scenario where there is a non clustered index on the member_number column and clustered index on the account_number column. The main thing to note here is that the members table will now has leaf level index entries containing the clustered index key as a pointer instead of Row ID when there is no clustered index on the table. So in order to access the data rows, sql server has to traverse through the clustered index. In our example in order access row with member_number 9000, sql server will traverse through the non clustered index and in the leaf level it will find the index entry for 9000 having a clustered index key which will be the account_number. Based on this account_number it traverses through the clustered index and finds the data row. Suppose we have 100 rows to fetch, we have to access the clustered index 100 times to fetch the data pages containing the rows.
This is very inefficient and probably the query optimizer will choose a table scan instead of an index scan.
In the same scenario let us change the query and execute the following query.
Select account_number from members where member_number between 9000 and 10000
So sql server will traverse through the non clustered index as before until the leaf level where it finds the clustered index key which is the account_number in this case. So sql server need not go fetch the data pages traversing through the clustered index again, since it found the member_number and account_number side by side in its index entries.
So the query optimizer will choose the index scan since it is very efficient in this case.

Retrieving rows with multiple non clustered indexes on the table

Select * from members where account_balance between 500 and 5000 and member_number between 9000 and 9500
Let us say there is a non clustered index on account_balance column and another non clustered index on the member_number column with out any clustered index on the table. In this case the query optimizer can use both the indexes. Since there is no clustered index,  the leaf level of the indexes has Row IDs as pointers to the rows. So the query optimizer examines both indexes and gets the sets of Row IDs that match the given selection.Then it gets the intersection of both the sets of Row IDs. So instead of pulling the individual data pages, the query optimizer first gets a result set of all the RowIDs that match both the criteria. Based on how many rows satisfy the condition it estimates whether to make use of the indexes or to do a table scan to retrieve the rows. Let us say there are 100 rows that satisfy this query and there are 10,000 rows spread over 500 data pages (20 rows per page) in the table, it may be efficient to use the index scan and do 100 logical reads to get the data pages than scanning all the 500 data pages.

Conclusion

The decision of the query optimizer depends on the cost it would incur or in other words how many logical or physical reads it has to do to retrieve the result of a query. Hopefully these different scenarios will help a little bit in better understanding the usage of indexes by query optimizer.
Any comments are welcome.

References

Microsoft SQL Server 2000 Performance Optimization and Tuning Hand Book by Ken England.

No comments:

Post a Comment