Indexes and Fragmentation in SQL Server 2000 Part 2
This is the second article in this series. Please read
Part 1 if you intend to carry out the investigation for yourself.
So what have we covered so far? Where are we at and what are we going to talk about next?
You should be at the point where you have the database we created in part 1 and have added lots and lots of random rows to it. I have been using 500k+ rows for my investigation but I think something around 100k will suffice. It's useful to have more because it makes it easier to identify performance issues (good and bad) when we are working with our database, however it does mean that you are waiting around more when running queries that must scan all the data.
In the last article we looked at SQL's basic storage unit, the page, and how data is stored in them. We are now going to expand on this and look deeper into some of the different types of pages and how SQL allocates data to them. Please remember that this series is more of an academic investigation rather than a quick fix or recommendation on how best to manage index fragmentation. You might find some of the information irrelevant but I hope it will prove useful for your development and understanding of this topic. So moving on...
SQL has a number of different types of pages which it uses to store system information and user data. The actual size of the pages will always remain the same however the structure is different as each is used in different ways.
SQL actually uses pages to store everything it needs to manage your database. Other Windows applications might use config files, registry settings, etc. to function but remember you may have to move your database to another server one day so all the configuration information, as well as the data, needs to reside within the pages of the database. Saying that SQL does use the other areas to store environmental information.
We will be looking at the following data page types:
Data Page
The main type of page used to store user data records.
PFS Page - Page Free Space
Used to store allocation and free space information about pages.
GAM and SGAM Pages
Used for allocating extents.
IAM Page - Index Allocation Map
Used to store information about where pages are stored for a particular table or index.
Index Page
Used to store leaf and non-leaf records for clustered and non-clustered indexes. * we'll talk about this page in the next article.
There are a number of other page types but those listed above are the important ones which we will now look at in more detail.
When you create a new (empty) database no 'data pages' will exist until you add your first record. We saw this in part 1 of this series. However when we inserted a new record we found that SQL created a new data page in location 34 (this may have been a different page for you). But why didn't SQL create and store this record in page 1 (i.e. at the beginning)? Well that's because the preceding pages were already 'allocated' (an important term as you will see) to something else.
As mentioned above SQL will use pages to store system-type information and don't forget in a new database SQL will automatically create system tables such as SYSOBJECTS, SYSINDEXES, etc. so these already exist before any user objects are created.
Lets run the DBCC PAGE command on the first few pages of our database...
DBCC TRACEON (3604) DBCC PAGE (myIndexes, 1, 0, 3)
This command has instructed SQL to show us the contents of the very first page in your data file. This is the File Header page. Let's analyze some of the information on this page.
First of all you will see at the top of your output 'PAGE (1:0)'. The 1 refers to the file id (as specified in the DBCC PAGE command). I'm sure you can see the relevance of this number if you were to add secondary data files to your database, you would have to change this number to 2 to view the contents of your first secondary data file.
It is important that you start to think about this topic in terms of data files as well as in database terms. Due to various limitations your data files are split internally into intervals, and multiple data files can make up your database. Just bear this in mind while reading the following text - pick up on when I use each term.
The 0 refers to the data page number (as specified the DBCC PAGE command). All pages contain this information. And just for your information, in case you haven't researched the DBCC PAGE command, the last parameter is a number representing the amount of information SQL displays about the page - 3 being detailed.
Looking down the through the output you will see information about the actual .mdf file - what it has been allocated to, how full it is, the filegroup it belongs to, the auto growth rate, etc. Page 0 in each data file (primary or secondary) will always contain the fileheader information. I doubt you will ever need to use this but now you know about it and can start to see how pages build up to make your database file.
Lets move on to the 2nd page in our data file. Open a new query page and run the following...
DBCC TRACEON (3604) DBCC PAGE (myIndexes, 1, 1, 3)
Compare how the contents of page 1 are different to page 0.
Page 1 is always a PFS (Page Free Space) page. An example of what you should see is as follows...
(1:12) - (1:13) = ALLOCATED 100_PCT_FULL IAM Page Mixed Ext
Here we can see that pages 12 to 13 are allocated, are 100% full and are allocated as IAM pages (mixed extent). We don't need to know what this means yet but you should understand that SQL uses the PFS page to work out how full our pages are, where new data can be inserted, when new data pages need to be created, etc. It is able to efficiently manage the allocation of data and quickly obtain free space information within its data file all from this simple 8k structure.
The PFS page is a byte-map, meaning that every byte in the page represents a page in the data file. Each byte can be set to a particular value that represents whether the corresponding page is empty or how full it is, if it is a mixed page, etc.
If you want to know how many pages are in your data file look at the higher number on the last row of the PFS page. Note this may differ from the value SYSINDEXES reports when you run the following...
SELECT * FROM SYSINDEXES WHERE name = 'myTable'
Look at how many [dpages] are used up by your data - you might see that this is different from the total number you saw in the previous DBCC PAGE command. SQL needs to store information about your data and you can see how many pages it uses to do this by looking at the [reserved] column. Adding both these columns together should match the value you saw in the previous DBCC PAGE command. If they don't then you'll probably have another PFS page somewhere. If you have added lots of records to your table then you might find the values don't add up as expected but it's useful to know about these to understand what we are trying to do. Hopefully by the end of this article you will know how to find all your pages.
Now you might be thinking what happens if I have millions of pages within my data file, how is our single 8k PFS page able to store this amount of information? Well SQL creates another PFS page after approximately 8000 pages and will keep doing so as the need arises. Remember the PFS page is a byte-map and so has 8000 bytes available to record information in. BOL state that approximately 8000 pages are recorded within the PFS page but I believe this value to be nearer 8088.
So if you were to find all the PFS pages in you data file(s) and count up the number of pages they deal with you would have the total for your database. Remember you would have to do this for all your data files though (transaction log is not classed as a data file).
Considering the management of page allocation and free space information is a vital core function of a database engine, I feel SQL's straight forward approach to these features is very well designed.
Ok now on to the next page...
DBCC TRACEON (3604)
DBCC PAGE (myIndexes, 1, 2, 3)
You are now viewing the 3rd page in your data file and this will always be used for the GAM (Global Allocation Map). The GAM is used for the allocation of extents. This page is full of bits (1 or 0) and each bit represents one extent - so it's a bit-map. Because the maximum size of a page is 8k this means that we can record allocation information for 64000 extents (8k = 8000 bytes = 64000 bits).
If the first bit in this page is a 0 then the first extent in the data file has been allocated, and likewise if it is a 1 then it has not been allocated. If the extent has not been allocated SQL knows it can use it to store newly inserted data. Like the PFS page another GAM will be created as the number of extents increase, so we'll get a GAM page for every 64000 extents (approximately 4GB). Quite straight forward so far.
View the next page...
DBCC TRACEON (3604)
DBCC PAGE (myIndexes, 1, 3, 3)
This is a SGAM (Shared Global Allocation Map) page and is used with the GAM to work out the type of an extent, and like the previous pages the first one will always appear in this location. There are 2 types of extents - a uniform extent stores pages to do with one particular object (i.e. a table or an index) and a mixed extent stores pages to do with multiple objects (up to 8 objects because an extent can hold up to 8 pages - note that a page cannot contain records from different tables or indexes). And again SQL will create another when it needs to. The SGAM also works in 64000 extent chunks.
SQL uses an algorithm that includes the bits from the GAM and SGAM to determine whether extents are allocated, the type of extent and free space information. It works like the PFS pages and is quite simple in design. For example SQL could quickly scan the GAM and SGAM for bit 1234 (i.e. extent 1234) and find that because the GAM = 1 (unallocated) and the SGAM = 0 it can be used to insert the new record a client machine has sent. There is a little more information in BOL if you want to know more - look up 'Managing Extent Allocations and Free Space'.
Data Pages
Running the DBCC PAGE command using a random page number try and find a page that contains some of your data. It shouldn't be too difficult to find one if you have a good number of records in your database.
When you find one look at some of the information contained in the header. m_objId refers to the object (table or index) and will match the id in SYSOBJECTS. You'll also notice that the m_type = 1 which indicates a data page.
Under the Allocation Status section you'll be able to see what GAM, SGAM and PFS the page is allocated to.
Records stored in data pages aren't necessarily stored in order. It is the slot offset that is sorted on insert or deletes - and this is a much faster way of managing the page, rather than reordering the records every time a change occurs. If you want to see this information run the DBCC PAGE command with the last parameter = 2.
OK so we have looked at how SQL stores data in pages (data pages), how it manages the allocation of pages and extents (GAM, SGAM + PFS pages), and how it manages free space information (SGAM + PFS pages). If you now imagine lots of new rows being inserted in to your SQL system you should have a good idea of SQL builds up the pages and records information about them.
Now try and imagine lots of reads occurring on your SQL system - how does SQL know which pages to search through and how can it do this efficiently? Obviously SQL doesn't scan 'every' page within your database each time a client queries the database. This is where IAM pages come in...
IAM Pages
If we ran the following query 'SELECT * FROM [myTable]' the system would only scan the pages belonging to [myTable]. It gets the list of pages from the IAM (Index Allocation Map). Don't get confused by the name - we still don't have any indexes configured on our table but SQL still needs some way of tracking the records associated with each table, and it manages this via a linked-list of IAM pages that store information about where an objects' extents and pages are located.
SQL creates an IAM page for each heap / clustered index, each non-clustered index and BLOB.
Let's have a look at how this is used...
SELECT * FROM SYSINDEXES WHERE NAME = 'myTable'
You'll see that one of the columns returned is called the [FirstIAM]. This points SQL to the first IAM page of your table. Now let's look at the contents of this page - you'll have to convert the number as we saw in part 1. We're only interested in the first half (assuming we are working with our primary data file). When you have this number you should know how to use it by now to view the page.
Looking through the contents of this page you will see there's the usual fields in the top sections (bpageno, m_objId, m_type, etc.). Then at the bottom you'll notice 2 sections:
nIAM: Single Page Allocations - this section details single pages that contain data from your table (these only appear in the first IAM page).
IAM: Extent Alloc Status Slot 1 - this section details extent ranges that contain data from your table.
It should be obvious to you which pages and extents are allocated to your table. Performing a DBCC PAGE on any of the pages listed will reveal some of your table rows.
So when we query our table SQL will look in SYSINDEXES for the FirstIAM page, then using the the pages and extents 'allocated' to our table, start scanning through them to find the user data. Remember we still don't have any indexes and are dealing with a heap.
The IAM is similar to the GAM in that it manages chunks of 64000 extents, and SQL will created more as needed. So unless your table is very big (more than 64000 extents = over 4GB) or has multiple data files you'll find that only one IAM page will exist for it.
Now to help you get a better feeling about all the different types of pages with your database I have included the following script that counts each page type in your data file. It will also give you the location of all the PFS and IAM pages in your data file to give you an idea of how these are spaced out. You can modify this to view the locations of other page types.
The script basically reads in all your pages in to a temporary table and then scans each for their type. You may have noticed in the output of the DBCC PAGE command that the page header section contains various fields - this is where the m_type field is stored which is used to determine the type of page it is.
Note that the script is currently configured to work on the primary data file and on the [myIndexes] database. You'll have to change these if you want to play about with it.
Here is a description of some of the page types you will see in your data file...
Page Header: m_type
0 = empty / unallocated page
1 = data page (user data pages and SQL reserved pages)
8 = gam (global allocation map)
9 = sgam (shared global allocation map)
10 = iam (index allocation map)
11 = page free space (PFS) page
15 = file header page
16 = differential bitmap page showing which extents have changed since the last full or differential backup
17 = a minimally-logged bitmap page showing which extents have changed since the last backup (for bulk-logged recovery mode)
As you will see when you run the script (which can take some time) there are more page types than we have discussed. I've never been able to find a comprehensive list, with detailed information, of what each type is used for so if anyone out there can provide one it would be very much appreciated (although you can usually work out what they are used for by looking inside them). You can find information on most of them by searching the Internet, and I recommend you spend some time doing this as it helps explain how other features work (eg. differential backups, etc.).
Please note the following script may return multiple PFS and IAM page results if your table does contain a lot of records. Only one of these will actually be a PFS page - I'm sure you'll be able to work out which.
-- Joe Doherty - SQLServerCentral.com - Index Fragmentation.
-- This script will scan all pages within your data file and then count and report on each type.
DECLARE @loop INT
DECLARE @mySQL VARCHAR(250)
DECLARE @noTypes INT
DECLARE @pageType INT
DECLARE @pageCount INT
DECLARE @pfsPage VARCHAR(50)
DECLARE @iamPage VARCHAR(50)
-- This trace must be switched on for SQL to report to the end user
DBCC TRACEON(3604)
-- Create a temporary table to store the contents of each page in
-- This can take up a significant amount of space if your data file is large
CREATE TABLE #myTemp (parentObject VARCHAR(200), object VARCHAR(200), field VARCHAR(200), value VARCHAR(200))
SET @loop = 0
-- Start to record every page
WHILE 1 = 1
BEGIN
-- The following statement runs DBCC PAGE. You should change the 2nd parameter to reflect that of your data file (i.e. a primary or secondary)
-- Remember you'll have to change the database if you're not using the test one we created in part 1
SET @mySQL = 'DBCC PAGE (myIndexes, 1, ' + CAST(@loop AS VARCHAR) + ', 0) WITH TABLERESULTS'
INSERT INTO #myTemp EXEC (@mySQL)
IF @@ROWCOUNT = 0 BREAK
PRINT 'Processed page ' + CAST(@loop AS VARCHAR)
SET @loop = @loop + 1
END
-- Create some indexes to help the following statements
CREATE CLUSTERED INDEX myIndex1 ON #myTemp (field)
CREATE INDEX myIndex2 ON #myTemp (value)
CREATE INDEX myIndex3 ON #myTemp (object)
-- Count the number of unique page types in your data file
SET @noTypes = (SELECT COUNT(DISTINCT value) FROM #myTemp WHERE [field] = 'm_type')
SELECT 'There are ' + CAST(@noTypes AS VARCHAR) + ' different types of pages within your data file.'
-- Configure a cursor to display each number of page types
DECLARE myCursor CURSOR FOR (SELECT DISTINCT value FROM #myTemp WHERE [field] = 'm_type')
OPEN myCursor
FETCH NEXT FROM myCursor INTO @pageType
WHILE @@FETCH_STATUS = 0
BEGIN
SET @pageCount = (SELECT COUNT(*) FROM #myTemp WHERE [field] = 'm_type' AND [value] = @pageType)
SELECT 'There is ' + CAST(@pageCount AS VARCHAR) + ' page(s) of type ' + CAST(@pageType AS VARCHAR)
FETCH NEXT FROM myCursor INTO @pageType
END
-- Show where PFS pages are
DECLARE myPFSCursor CURSOR FOR (SELECT [object] FROM #myTemp WHERE [field] = 'm_type' AND [value] = '11')
OPEN myPFSCursor
FETCH NEXT FROM myPFSCursor INTO @pfsPage
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT 'There is a PFS page located at page '
SELECT [value] FROM #myTemp WHERE [object] = @pfsPage AND [field] = 'm_pageId'
FETCH NEXT FROM myPFSCursor INTO @pfsPage
END
-- Show where IAM pages are
DECLARE myIAMCursor CURSOR FOR (SELECT [object] FROM #myTemp WHERE [field] = 'm_type' AND [value] = '10')
OPEN myIAMCursor
FETCH NEXT FROM myIAMCursor INTO @iamPage
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT 'There is a IAM page located at page '
SELECT [value] FROM #myTemp WHERE [object] = @iamPage AND [field] = 'm_pageId'
FETCH NEXT FROM myIAMCursor INTO @iamPage
END
CLOSE myIAMCursor
DEALLOCATE myIAMCursor
CLOSE myPFSCursor
DEALLOCATE myPFSCursor
CLOSE myCursor
DEALLOCATE myCursor
DROP TABLE #myTemp
Conclusion
After examining the results of the above script and having a dig around in the pages you will hopefully have a better feeling on how SQL builds your database and feel more confident in your knowledge of SQL's storage engine.
It's quite impressive that SQL is able to manage 4GB of data with a handful of 8k pages, meaning that 99.9% of your data file(s) is data (and their indexes).I think we have done enough ground work now to be able to move on to what this series is intended for - indexes and fragmentation. In the next article we will get straight to creating an index and see how this affects our pages. Then we'll run a script that causes fragmentation and monitor it as it grows. At that point we can start to look at performance degradation and the methods available to resolve it.
As always feedback is beneficial to everyone who reads this article. If you feel you have something to add please do.
Until next time...
Joe
No comments:
Post a Comment