Evan Reiser's Blog: April 2007

Friday, April 27, 2007

How Online Indexing Works

The default behavior of either method of rebuilding an index is that SQL Server takes an exclusive lock on the index, so it is completely unavailable while the index is being rebuilt. If the index is clustered, the entire table is unavailable; if the index is non-clustered, there is a shared lock on the table meaning no modifications can be made but other processes can SELECT from the table (But obviously they cannot take advantage of the index being rebuilt). Now this is pretty miserable in large databases since queries wont be able to take advantage of indexes resulting in our arch nemisis: table scans.

The online build works by maintaining two copies of the index simultaneously, the original (source) and the new one (target). The target is used only for writing any changes made while the rebuild is going on. All reading is done from source as well. SQL Server row-level versioning is used so anyone retrieving information from the index will be able to read consistent data.

Here are the steps involved in rebuilding a non-clustered index

A shared lock is taken on the index, which prevents any data modification queries and an Intent-Shared lock is taken on the table
The index is created with the same structures as the original and marked as write-only
The shared lock is released on the index, leaving only the Intent-Shared lock on the table.
A versioned scan is started on the original index, which means modifications made during the scan will be ignored. The scanned data is copied to the target
All subsequent modifications will write to both the source and the target. Reads will use only the source
The scan of the source and copy to the target continues while normal operations are performed.
The scan completes
A Schema-Modification-Lock (most strict lock) is taken to make the source completely unavailable
The source is dropped, metadata is updated, and the target is made to be read-write
The Schema-Modification-Lock is released.

A clustered index rebuild works exactly like a non-clustered rebuild property as long as there is no schema change (a change of index keys or uniqueness property).

For a build of a new clustered index or a rebuild of a clustered index with a schema change there are a few more differences. First, an intermediate mapping index is used to translate between the source and target physical structures. Additionally, all existing non-clustered indexes are rebuilt one at a time after a new base table ahs been built. Creating a clustered index on a heap with two non-clustered indexes involves the following steps:

Create a new write-only clustered Index
Create a new non-clustered index based on the new clustered index
Create another new non-clustered index based on the new clustered index
Drop the heap and the two original non-clustered indexes

Online Index rebuilding can be costly as the server must maintain up to 6 structures at the same time, however this is incredibly useful for removing fragmentation or re-establishing a fillfactor when the data must be available 24/7 in high availability systems.

Thursday, April 26, 2007

Werner Heisenberg Bug aka Observer Effect

Heisenberg Bug (c) Dan.Oscar.Mckinley. 2007

When dynamically creating a dropdown extender control form the ajaxControlToolkit, I was attempting to set the TargetControlID to be the ID of a Linkbutton.

OnPreRender:
DropDownExtender.TargetControlID = dd.Parent.Controls(2).ID

However, when this ran it told me that TargetControlID cannot be an empty string. Hmm weird, i only have some idea how the uniqueId's are generated, but i dont know how exactly or when in the page lifecycle, but I'm pretty sure it should be available by the preRender phase.

I set a break point and on this line, and in the immediate window i check the client id

?dd.Parent.Controls(2).ClientID
"GResults1_dg_ctl02_ctl13"
?dd.Parent.Controls(2).ID
"ctl13"

What the hell is going on, the ID is set, i dont know why TargetControlID is an empty string. I continue running, and the page loads fine. I refresh and it errors again. I step through this code, check the ID's, and it works again.

After a little research (aka not checking on google and going straight to a Sr. Developer) i found out that ASP.NET will only build up a control ID, when the ClientID property is referenced. This makes a lot of sense since you dont want to waste time generating ClientID's for every control: only the ones you are planning on referencing on the client side.

I changed the code to reference the ClientID, and now the ID is correctly created (although in this case i'm setting it myself)

OnPreRender:
dd.Parent.Controls(2).ID = dd.Parent.ClientID + "_lb"
DropDownExtender.TargetControlID = dd.Parent.Controls(2).ID

The interesting part here is that i couldn't actually see the bug here since my commands on in the immediate window were affecting the objects. Observing the ClientID caused the ID to be generated, making it seem like the ID should not be nothing at this point in the code. Maybe the underlying problem is that people shouldn't be programming serious apps if they dont understand what is going on under the hood. Note to self: learn more stuff.

Longhorn Beta 3 Evaluation

Windows Server Code Name "Longhorn" Beta 3 Evaluation

IIS7, RODC, PowerShell, Serer Manager, NAP, Failover Clustering, etc

Why Indexed Views are Cool

One of the most important benefits of Indexed Views (aka materialized views) is the ability to materialize summary aggregates of large tables. Normal views are only saved queries and do not store the results. Every time the view is referenced, the aggregation to produce the grouped results must be recomputed.

However when you create an index on the view, the aggregate data is stored in the leaf level of the index. Aggregate and reporting queries can then be processed using the indexed views without having to scan underlying large tables. Yeah, read that again, its pretty damn cool.

The first index you must build on a view is a clustered index, and because the clustered index contains all the data at its leaf level, this index actually does materialize the view. The views data is physically stored at the leaf level of the clustered index.

Because of their special nature, Indexed Views (and Indexed Computed Columns) must only contain deterministic functions (a function that returns the same result every time it is called with the same set of input values). Expressions or functions that return float or real values are not acceptable since these values are imprecise as they can be computed differently on different system architectures.

Monday, April 23, 2007

B-Trees: Difference Between Clustered and Non-Clustered Indices

B-Trees
First lets be clear on what a B-Tree is, and why they are important in Database Systems. A B-Tree index provides fast access to data by searching on a key value of the index. B-Trees cluster records with similar keys. The B stands for balanced (not binary!), and balancing the tree is a core feature of the B-tree’s usefulness. The trees are managed and branches are grafted as necessary, so navigating down the tree to find a value and locate a specific record takes only a few page accesses. Because the trees are balanced, finding any record only requires (about) the same number of resources, and retrieval speed is consistent because index has the same depth throughout the tree.

In any index, whether clustered or non-clustered, the leaf level contains every key value (or combination of values for composite indices) in key sequence. The biggest difference between a clustered and non-clustered index is what else in the leaf. Below I'll explain in detail what the difference between a Clustered and Non-clustered index is:

Clustered Index
The leaf level of a clustered index contains the data pages, not just the index keys. Read that last line again. All columns of every row are in the leaf level; the data itself is part of the index. A clustered index keeps the data in a table ordered around the key. The data pages in the table are kept in a doubly linked list called a page chain (In a Heap pages are not linked together). Therefore the order of pages in the page chain, and the order of rows on the data pages, is the order of the index key or keys. When the key is found in a clustered index the data has been found, not simply pointed to.

Since the actual page chain for the data pages can only be ordered in one way, a table can only have one clustered index. It is a common misconception that clustered indexes store the data in sorted order on the disk. Sorted order simply means that the data page chain is logically in order, if the SQL Server follows the page chain it can access each row in the clustered index key order. New pages can be added simply by adjusting the links in the page chain

Non-Clustered Index
In a non-clustered index, the leaf level does not contain all the data. In addition to the key values, each inde row in the leaf level contains a bookmark that tells you where to find the data row corresponding to the key in the index. If the table is a heap (no clustered index) the non-clustered index’s bookmarks are row identifiers (RID) which is an actual row locator in the form File#.Page#.Slot#

The presence or absence of a non-clustered index does not affect how the data pages are organized; therefore you are not restricted to only having one non-clustered index per table. When you search for data using a non-clustered index the index is traversed, and then SQL server retrieves the record or records pointed to by the leaf-level indexes.

Sunday, April 22, 2007

Locking in SQL Server 2005

Depending on the Transaction Isolation Level set for your transaction, your queries will request various locks in order to ensure the correct consistency behaviors. I'd like to go over some of the different types of locks that may be issued as well as explain some of their differences.

Shared Locks
Shared locks are acquired automatically when data is read. Shared locks can be held on a table, page, index key, or individual row. Many processes can hold shared locks on the same data, but no process can acquire an exclusive lock on data that has a shared lock on it (unless it is the only process holding a shared lock). Normally shared locks are released as soon as the data is read, however this can be changed via query hints or depending on the isolation level

Exclusive Locks
Exclusive locks are acquired automatically when data is modified with an insert, update or delete operation. Only one process at a time can hold an exclusive lock on a particular data resource, no other locks of any kind can be acquired once an exclusive lock is held. Exclusive locks are held until the end of the transaction: this means that changed data is not available to any other process until the current transaction commits or rollsback

Update Locks
Update locks are acquired when the server executes a data modification operation but first needs to search the table to find the resource that will be modified. An update lock is not sufficient to allow you to change the data; all modifications require that the data resource being modified have an exclusive lock. An update lock acts as a serialization gate to queue future requests for the exclusive lock (many processes can hold shared locks for a resource but only one process can hold an update lock). As long as a process holds an update lock on a resource no other process can acquire an update lock or an exclusive lock for that resource; instead, another process requesting an update or exclusive lock for the same resource must wait. The process holding the update lock can convert in into an exclusive lock on the resource because the update lock prevents lock incompatibility with any other processes. Update locks can also be known as “intent-to-update-locks.” The reason this is important is because serializing access for the exclusive lock lets you avoid conversion deadlocks. Update locks are held until the end of the transaction or until they are converted into an exclusive lock.

Intent Locks
Intent locks are not a separate mode of locking; they are a qualifier to the modes mentioned above: you can have intent shared locks, intent exclusive locks, and intent update locks. Because SQL Server can acquire locks at different levels of granularity, a mechanism is required to indicate that a component of a resource is already locked.

Special Lock Modes (Schema stability locks, schema modification locks, bulk update locks)
When queries are compiled, schema stability locks prevent other processes from acquiring schema modification locks, which are taken when a table’s structure is being modified. Bulk insert locks are acquired during various bulk inserts such as BULK INSERT or by using the TABLOCK hint. Requesting this special bulk update table lock does not necessarily mean it will be granted; if other processes already hold locks on the table, or if the table has any indexes, a bulk update lock cannot be granted. If multiple connections have requested and received a bulk update lock they can perform parallel loads into the same table. Unlike exclusive locks, bulk update locks do not conflict with each other, so concurrent inserts by multiple connections is supported.

Conversion Locks
Conversion locks cannot be directed requested by SQL server but are the result of a conversion from one mode to another, in SQL Server 2005, these consist of SIX, SIU, UIX. The most common of which is the SIX, which occurs if a transaction holding a shared lock on a resource and later an IX lock is needed. The lock mode would be indicated as SIX.

Key Locks
For certain isolation levels (Read Committed, Repeatable Read, or Snapshot) SQL Server tries to lock the actual index keys accessed while processing the query. With a table that has a clustered index, the data rows are the lead level of the index, and you will see key locks acquired. If the table is a heap, you might see key locks for the non-clustered indexes and row locks for the actual data.

Key Range Locks
Additonal lock modes – called key range locks, are taken only in the Serializable isolation level for locking ranges of data. There are 9 times of key-range locks, and each as a two part name: the first part indicates the type of lock on the range of data between adjacent index keys, and the second part indicates the type of lock on the key itself. Most of which are very rare and/or transient. These types of key range locks are:

RangeS-S –Shared lock on the range between keys (shared lock on the key at the end of the range)
RangeS-U –Shared lock on the range between the keys (update lock on the key at the end of the range)
RangeIn-Null – Exclusive lock to prevent inserts on the range between keys; no lock on the keys themselves
RangeX-X – Exclsuive lock on the range between keys; exclusive lock on the key at the end of the range
RangeIn-S – Conversion: S + RangeIn-Null
RangeIn-U – Conversion: U + RangeIn-Null
RangeIn-X – Conversion: X + RangeIn-Null
RangeX-S – Conversion: RangeIn-Null + RangeS-S
RangeX-U – Conversion: RangeIn-Null + RangeS-U

Saturday, April 21, 2007

Transaction Isolation Levels in SQL Server 2005

If the effects of your transactions are important to you, its important to understand what problems can arise from concurrency and what steps must be taken to avoid these consistency problems. Below I've listed the various Transaction Isolation Levels available to you in SQL Server 2005

Uncommitted Read
In Uncommitted Read isolation, all the dependency/consistency problems/behaviors except lost updates can occur. Queries will read uncommitted data, and both non-repeatable reads and phantoms are possible. Uncommitted read Is implemented by allowing read operations to not take any locks, since no locks are requested, it won’t be blocked by conflicting locks and acquired by other processes. Using uncommitted reads, you trade off strongly consistent data for high concurrency of the system.

Read Committed (locking / pessimistic)
Read committed isolation ensures that an operation never reads data that another application has changed has not yet committed. With read committed (locking) if another transaction is updating data and consequently has exclusive locks on data rows, your transaction must wait for those locks to be released before you can use that data (whether or not you are reading or writing that data). Also your transaction must put share locks on the data that will be visited, however these locks can be removed as soon as the data is read rather then waiting till the end of the transaction.

Read Committed (snapshot / optimistic)
Read committed (snapshot) also ensures that an operation never reads uncommitted data, but not by forcing other processes to wait: every time a row is updated the SQL server generates a version of the changed row with its previous committed values. The data being changed is still locked but other processes can see the previous versions of the data as it was before the update operation began.

Repeatable Read
This isolation level adds to the properties of committed read by ensuring that if a transaction revists data or a query is reissued the data will not have changed. The cost of this extra safeguard is that all shared locks in a transaction must be held until completion (either COMMIT or ROLLBACK) of the transaction.

Snapshot
Snapshot is a optimistic isolation level, like read commuted (snapshot) it allows processes to read older versions of committed data if the current version is locked, the difference between snapshot and read committed (snapshot) has to do with how old the older versions have to be. Although behaviors prevented by snapshot isolation are the same as those prevented by Serializable snapshot is not truly a Serializable isolation level. With snapshot isolation it is possible to have two transactions executing simultaneously that give us a result that is not possible in serial execution
Ex/

Tranasaction 1	Transaction 2
Use pubs declare @price money begin transaction	Use pubs declare @price money begin transaction
Select @price = price From titles Where title_id = 1	Select @price = price From titles Where title_id = 2
Update titles set price = @price Where title_id = 2	Update titles set price = @price Where title_id = 1
Commit Transaction	Commit Transaction

There is no serial execution for these 2 queries wher the 2 titles won’t end up with the same price

Serializable
The Serializable isolation level adds to the properties of repeatable read by ensuring that if a query is reissued, rows will not have been added in the interim, IE phantoms will not appear. Serializable is the strongest of the pessimistic isolation levels because it prevents all the possible undesirable behaviors. The cost of the added safeguard of preventing phantoms is similar to the repeatable read, all the shared locks in the transaction must be held until completion of the transaction. In addition, enforcing the Serializable isolation level requires that you not only lock data that has been read, but also lock data that does not exist. The Serializable name comes from the fact that running multiple Serializable transactions at the same time is the equivalent of running them one at a time (serially)

Friday, April 20, 2007

Difference Between Pessimistic Concurrency and Optimistic Concurrency

In Pessimistic concurrency the server acquires locks to block access to data that another process is using. Pessimistic concurrency avoids conflicts by acquiring locks on data that is being read, so no other process can modify that data, it also acquires locks on data being modified so no other processes can access that data for either reading or modifying. Readers block writers and writers block readers.

In Optimistic Concurrency the server uses row versioning to allow data readers to see the state of the data before the modifications occur. Older versions of data rows are saved so a process reading data can see the data as it was when the process started reading and not be affected by any changes being made to that data. A process that modifies data is unaffected by processes reading the data because the readier is accessing a saved version of the data rows, readers do not block writers and writers do not block readers.

Thursday, April 19, 2007

Consistency Problems / Dependency Behaviors in Database Systems

Since database systems must have concurrency (allowing multiple people/connections/users access to the data at the same time), certain problems arise when transactions are ran in parallel instead of serially. Each of the following are dependency/consistency problems/behaviors, and it is key to understand each of these in order to fully understand Transaction Isolation and various locking/versioning methods used to prevent these potential problems

Lost Updates:
This behavior occurs when two processes read the same data and both manipulate the data, changing its value, and then both try to update the original data to the new value. The second process might completely overwrite the first update. These behaviors should be avoided in almost all cases

Dirty Reads:
This behavior occurs when a process reads uncommitted data. If one process has changed data but not yet committed the change, another process reading the data will read it in a inconsistent state. The process updating the data has no control over whether another process can read its data before its committed, its up the to the process reading the data to decide whether or not it wants to read the data before it is committed

Non-repeatable Reads:
Also known as inconsistent analysis, a read is a non-repeatable read if a process might get different values when reading the same resource in two separate reads within the same transaction. This can happen when another process changes the data in between the reads that the first process is doing.

Phantoms:
This behavior occurs when membership in a set changes. A phantom occurs if two SELET operations using the same predicate (such as count(members) > 10)in the same transaction return a different number of rows.

Wednesday, April 18, 2007

ACID Properties

Transaction processing guarantees the consistency and recoverability of a database ensuring that all transactions are performed as a single unit of work even in the presence of logical or physical failure. Such transactions are said to have ACID properties, here are the common definitions of these four ACID properties.

Atomicity:
Atomicity means that each transaction si treated as all or nothing, it either commits or aborts; if it commits the effects remain, if it aborts then all the effects are undone.

Consistency:
The consistency property ensures that a transaction won’t allow the system to arrive at an incorrect logical state – the data must always be logically correct. Constraints and rules are honored even in the event of system failure.

Isolation:
Isolation separates concurrent transactions from the updates of other incomplete transactions.

Durability:
After a transaction commits the effects of the transaction must persist even if system failure occurs. If a system failure occurs while a transaction is in progress the transaction is completely undone, leaving no partial effects on the data.

SQL Server & Database Systems

I haven't posted to my blog for a while, but i'd like to get started again. There has yet to be any focus to this blog so i'd like to make it a bit more academic and start focusing on: Computer Science issues, database systems, and computer system design. Hopefully this will help document some of my knowledge and help someone solve some problems similar to what i encounter.

I'd like to first start talking about SQL Server 2005, and go through some Database Systems Knowledge, this will help me solidify my knowledge in these categories and hopefully open up some topics for discussion. Enjoy.

Evan Reiser's Blog