Evan Reiser's Blog: May 2007

Thursday, May 31, 2007

Generating a Numbers table

A numbers table can be really useful for lots of reasons, but i'm not going to go over that here, i'm just going to document my journey trying to create one.

Unfortunately, I'm a .NET programmer so i normally first think of a procedural way to do things. When i wanted to create a numbers table I started off with a looping solution, its not pretty, but it works, and you only have to run it once so who cares. so i started off with this

SET NOCOUNT ON
BEGIN TRAN
DECLARE @LoopCounter INT
SET @LoopCounter = 1
WHILE @LoopCounter <= 10000
BEGIN
INSERT numbers_tbl
VALUES(@LoopCounter)
SET @LoopCounter = @LoopCounter + 1
END
COMMIT WORK
GO

This took 29 seconds on our development DB server, not bad for a one time cost, but there must be a different way, i tried using a recursive query that used a table expression. This looked like a pretty sweet query

DECLARE @n AS BIGINT;
SET @n = 1000000;

WITH Nums AS
(
SELECT 1 AS n
UNION ALL
SELECT n + 1
FROM Nums
WHERE n < @n
)
INSERT INTO numbers_tbl
SELECT n
FROM Nums OPTION(MAXRECURSION 0);
GO
>Table 'numbers_tbl'. Scan count 0, logical reads 1009505
>Table 'Worktable'. Scan count 2, logical reads 6000001

Turned out this took 45 seconds, it ended up being a bit slower than my procedural version. Now this approach is kinda lame, really all we need to do is generate the first 1000 rows and then do a cross join on itself to generate the million rows required. However these rows won't have the right numbers per se, but we can use the row number function to give each row a number yielding us a numbers table from 1 to a million.

DECLARE @n AS BIGINT;
SET @n = 1000000;
WITH Base AS
(
SELECT 1 AS n
UNION ALL
SELECT n + 1 FROM Base WHERE n <>
),
Expand AS
(
SELECT 1 AS c
FROM Base AS B1, Base AS B2
),
Nums AS
(
SELECT ROW_NUMBER() OVER(ORDER BY c) AS n
FROM Expand
)
INSERT INTO numbers_tbl
SELECT n FROM Nums
WHERE n <= @n
OPTION(MAXRECURSION 0);

Ah success this took 18 seconds, 9K reads opposed to 7M, i'm guessing as the number of entries in the numbers table goes up, this query will scale much better

This is good, but its not perfect. (Not perfect enough for Voodoo-Itzik-black-sql-magic arts practitioners). The problem with the solution before was that we still are stuck doing recursive selects for the first 1000 rows, why start off with the square root? why not add an additional crosss join, allowing us to start off at the sqrt(sqrt(@n)). In fact lets just start off with 2, and will continue to cross join these together (increasing by an exponential factor of 2 on each join). With 5 joins we can generate 4.2B rows, and we use the minimal number of initial selects.

In this case, We start with a CTE that only has 2 rows, and multiple it by the number of rows with each following CTE by cross-joining two instances of the previous CTE. This results in 2^2^N rows, we can use the same row number trick to generate the actually numbers to insert into our table.

DECLARE @n AS BIGINT;
SET @n = 1000000;
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A, L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A, L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L5)
INSERT INTO numbers_tbl
SELECT n FROM Nums
WHERE n <= @n;
GO

This query ran in only 2 seconds, if you want more than 4B rows you can add an additional level (L6) for 2^64 rows. Chances are no machine can even store that much so i've left off L6.

Now you can have the largest numbers table at your company.

Thursday, May 24, 2007

When to not not use Cursors Part 2

If you're a new-hire or intern of ours at Capital IQ, stop reading, never use cursors. Ever.

Well my last attempt to prove that cursors can sometime be useful was derailed by my co-worker Mike Forman

This time i have a better (read: valid) example

I have the following table:

Which shows a bunch of employees at an unamed company, and the number of bugs they've fixed each day. For each of these employees i want to determine the running total of bugs fixed, each day. I can use the results of this query to create a pretty graph of bugs fixed over time per developer.

Now, using a cursor based solution we scan each piece of data once, which garauntees we have O(n) performance, meaning that as the number of entries in our table increase we can still create the desired results in time proportional to the number of rows. A set based solution suffers from O(n^2) performance (assuming there is no index on empid, BugsFixed). Even if there was an index a scan will results in (developers * (days + days^2)/2) rows scaned...which basically simplifies to O(n^2).

Now since cursors involve some overhead, the cursor solution will lose to the set based solution for a small number of items, however due to the performance limitations described above, the cursor solution is the only scalable choice to solve this problem.

Now watch Mike beat this using the OVER Clause.....

Code to Create this table:

--CREATE TABLE
IF OBJECT_ID('tempdb.dbo.BugCounts') IS NOT NULL
DROP TABLE tempdb.dbo.BugCounts;
GO

CREATE TABLE tempdb.dbo.BugCounts
(
empid INT NOT NULL,
workDay smalldatetime NOT NULL,
bugsFixed INT NOT NULL,
PRIMARY KEY(empid, workDay)
);

--POPULATE TABLE create 10K data points
DECLARE
@newn AS INT, @newempid AS INT,
@newworkDay As INT, @newbugsFixed As INT
DECLARE C CURSOR FAST_FORWARD FOR
SELECT top 10000 n
FROM numbers_tbl
OPEN C
FETCH NEXT FROM C INTO @newn;
WHILE @@fetch_status = 0
BEGIN
INSERT INTO tempdb.dbo.BugCounts
VALUES (@newn%25, dateadd(day, rand()*-300, GetDAte()), CAST(RAND()*100 AS INTEGER))
FETCH NEXT FROM C INTO @newn;
END
CLOSE C;
DEALLOCATE C;

Cursor Solution:

DECLARE @Result
TABLE(empid INT, workDay SMALLDATETIME, bugsFixed INT, runbugsFixed INT);

DECLARE
@empid AS INT,@prvempid AS INT, @workDay SMALLDATETIME,
@bugsFixed AS INT, @runbugsFixed AS INT;

DECLARE C CURSOR FAST_FORWARD FOR
SELECT empid, workDay, bugsFixed
FROM tempdb.dbo.BugCounts
ORDER BY empid, workDay;

OPEN C

FETCH NEXT FROM C INTO @empid, @workDay, @bugsFixed;
SELECT @prvempid = @empid, @runbugsFixed = 0;

WHILE @@fetch_status = 0
BEGIN
IF @empid <> @prvempid
SELECT @prvempid = @empid, @runbugsFixed = 0;

SET @runbugsFixed = @runbugsFixed + @bugsFixed;

INSERT INTO @Result
VALUES(@empid, @workDay, @bugsFixed, @runbugsFixed);

FETCH NEXT FROM C
INTO @empid, @workDay, @bugsFixed;
END

CLOSE C;
DEALLOCATE C;
select *
from @result
order by empid, workday;

Improving SQL Server Performance

True or False? SQL Server produces execution plans that minimize overall resource use to improve system wide performance.

False. Sql server minimizes the time it takes to return results to the client, if multiple cpus are available it will run parts of the query in parallel even though a single cpu solution would minimize overall resource use.

When dealing with database performance problems, many database professionals will look at a variety of metrics, queue sizes, cache hit ratios, etc. However, when users use your database the only important metric, how long it takes for the database system to return results. System performance metrics are a lot different than the user's perceived performance, at the end of the
day only the latter matters.

The storage engine optimizes execution plans with this strategy; Solid Quality Learning's query tuning methodology revolves around the same concept. Taking a top down approach allows you to spend your time fixing the worst bottlenecks.

Methodology:

Analyze waits at the instance level
Correlate waits with queues
Determine a course of action
Drill down to the database/file level
Drill down to the process level
Tune indexes and queries

So go: SELECT * FROM sys.dm_os_wait_stats

More true/false cause I was bored:

Queries that never change in which query data in tables that never changes, always produces the same result. False. Not all queries have a distinct/unique correct logical result. Sql server execution plans may change based on external factors, the execution plan may change as operations are reordered

Where clause are always evaluated after the join clauses. False. Where clauses are only logically evaluated after the joins however the query optimizer adjusts the physical evaluation of the query result by filtering before the joins for increased efficiency.

Monday, May 21, 2007

When to not not use Cursors

Cursors are usually bad to use; almost always there is a more efficient way to solve your problem using a set based solution. However there are some cases when cursors allow you to create a solution that is exponentially easier to implement than a set based solution. In general you should only resort to using cursors when a difficult set based solution becomes trivial when solved using cursors.

A classic example is that you have 5 classrooms of various sizes, and 3 classes of various sizes, and you want to assign a class to each classroom utilizing the minimum space required.

we'll simplify this problem and leave dates and times out of this issue :). Now this problem is very difficult to solve using a set based solution (it is possible, google itzik). However this is pretty trivial to solve using a cursor based solution, psuedo code below

Declare 2 cursors, one of the list of classrooms (lets call it: RoomsCursor) sorted by increasing capacity (number of seats), and another cursor for the list of classes (ClassesCursor) sorted by increasing number of students.
Now Fetch the first (smallest since you sorted ascending) class from the RoomsCursor
While the fetch returned a class that needs a classroom

Fetch the smallest unused classroom from RoomsCursor. if there is no available room, or the room is too small, continue and fetch the next smallest. Repeat fetching new rooms until you find a room that has fit or run out of rooms
If you didnt run out of rooms (and the last fetch yielded a room and the number of seats in the room is smaller than the number of students in the current room:

if you found a big enough room, schedule the class
else, you ran out of rooms!
fetch another Class

Return the scheduled events

In this case we are scanning both the classrooms and the classes in order. We never back up the cursor. We schedule classes by matching classes to class rooms until we either run out of classses to find classrooms for or we run out of rooms to accomidate classes. The only time there is an error is when no solution exists.

This solution runs in O(N) time since we are simply stepping through the cursor, the worst case solution is that we look at each class or classroom once.

this will set up the problem

USE tempdb;
GO
IF OBJECT_ID('dbo.Classes') IS NOT NULL
DROP TABLE dbo.Classes;
GO
IF OBJECT_ID('dbo.Classrooms') IS NOT NULL
DROP TABLE dbo.Classrooms;
GO

CREATE TABLE dbo.Classrooms
(
classroomid VARCHAR(10) NOT NULL PRIMARY KEY,
classSize INT NOT NULL
);

INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('C001', 2000);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B101', 1500);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B102', 100);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R103', 40);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R104', 40);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B201', 1000);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R202', 100);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R203', 50);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B301', 600);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R302', 55);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R303', 55);

CREATE TABLE dbo.Classes
(
classid INT NOT NULL PRIMARY KEY,
eventdesc VARCHAR(25) NOT NULL,
attendees INT NOT NULL
);

INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(1, 'Mikes Adv T-SQL Seminar', 193);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(2, 'CIQ .NET Pages', 51);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(3, 'How to Break the DB', 232);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(4, 'XAML ROCKS', 89);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(5, 'CIQ Security Issues', 897);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(6, 'Data Modeling 101', 46);
GO

CREATE INDEX idx_att_eid_edesc
ON dbo.Classes(attendees, classid, eventdesc);
CREATE INDEX idx_classSize_rid
ON dbo.Classrooms(classSize, classroomid);
GO

Cursor Solution:

DECLARE
@classroomid AS VARCHAR(10), @classSize AS INT,
@classid AS INT, @attendees AS INT;

DECLARE @Result TABLE(classroomid VARCHAR(10), classid INT);

DECLARE CClassrooms CURSOR FAST_FORWARD FOR
SELECT classroomid, classSize FROM dbo.Classrooms
ORDER BY classSize, classroomid;
DECLARE CClasses CURSOR FAST_FORWARD FOR
SELECT classid, attendees FROM dbo.Classes
ORDER BY attendees, classid;

OPEN CClassrooms;
OPEN CClasses;

FETCH NEXT FROM CClasses INTO @classid, @attendees;
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH NEXT FROM CClassrooms INTO @classroomid, @classSize;

WHILE @@FETCH_STATUS = 0 AND @classSize < @attendees
FETCH NEXT FROM CClassrooms INTO @classroomid, @classSize;

IF @@FETCH_STATUS = 0
INSERT INTO @Result(classroomid, classid) VALUES(@classroomid, @classid);
ELSE
BEGIN
RAISERROR('Not enough Classrooms for Classes.', 16, 1);
BREAK;
END

FETCH NEXT FROM CClasses INTO @classid, @attendees;
END

CLOSE CClassrooms;
CLOSE CClasses;

DEALLOCATE CClassrooms;
DEALLOCATE CClasses;

SELECT classroomid, classid FROM @Result;
GO

Thursday, May 10, 2007

What I Learned at Devscovery

We love XAML
We love LINQ
AJAX.ASP.NET sooo 2 weeks ago
If you aren't using silverlight, your not cutting edge
Jeffery Richter knows more than you do about programming
If your website/presentation/resume doesn't have graphics with reflections
then its not cool
Every Thing/Technology should be Asynchronous
Jeff Procise loves microsoft technology more than you do
People named Jeff are smart
WCF makes me fall asleep...or I passed out from XML spec overload
Read ingo rammers blog: http://blogs.thinktecture.com/
Make a macro for !do
Outofmemory exception occurs since we can't alloate spave in the large object heap
!dumpheap is the most important command except if you don't use -stat since you might have just given yourself a 10m coffee break
!dumpheap -min 1mb...figure out why we get outofmemory
Memory ressurection is a 'dark corner' of the garbage collector...aka necromancey.
Garbage collector is magical
None knows what a async pinned handle is
!traverseheap dump to xml - read in with clr profiler
'don't try to outhink microsoft'
Studt rotor code
Never call GC.collect() ....it is self tuning
Finalization is evil, unless you want performance consultants at your office
There are a ton of unannounced features in orcas
Buy .net memory profiler

Wednesday, May 09, 2007

Tips to Prevent SQL Injection

The following are examples of steps you can take to ensure some level of protection from SQL Injection however, any situation where you are generating dynamic SQL will leave you vulnerable to clever hackers.

In order to reduce the surface area for attack, do not enable functionality that isn't required such as the SQL Server Agent service or xp_cmdshell (which allows arbitrary commands to be run on the server)

Always provide minimal permissions to the executing user in order to limit their options and reduce your exposure. In SQL Server 2005, you can impersonate users, so the new credentials will even apply to code invoked dynamically at the server. This feature opens up a whole new set of security concerns. Dynamic SQL can now run under impersonated user credentials and
not even require direct permissions from the user executing the stored procedure.

Inspect users thoroughly and used stored procedures. If characters are allowed, use pattern matching to check whether SQL injection constructs (such as single quote, two dashes, sp_, xp_, UNION etc) exist in the input.

Always limit the lengths of inputs when possible. This will help reduce the hacker's ability to damage your system. Email address fields shouldn't be thoudands of characters long.

Use stored procedures! Stored procedures encapsulate user input to the database, type checking the input as well as allowing certain permissions.

In general dynamic sql is always dangerous since the users input can end up being executed. If possible its always safer to use static code as long as attention is paid to security issues. There are several tricks you can use to avoid dynamic sql such as using functions to parse input and invoke static code. Using static code will also give you a performance edge since the current implementation of stored procedures generates a new execution plan for each input.

If you ever need to expect quotes in your input (such as text inputs for a blog or something) a safe way to prevent sql injections is to simple replace CHAR(39) with CHAR(39)+CHAR(39) this will make it impossiblefor the hacker to escape the string. Using dynamic sql can be very powerful, however misuse and/or abuse can causeinefficient code that may open your database to attacks.

Tuesday, May 08, 2007

Correctly Setting Up tempdb

SQL Server stores data in tempdb for many activities that happen behind the scenes such as: spooling data for queries, sorting, row versioning, as well as holding temporary tables and table variables. Since this data is physically materialized on disk, tempdb becomes an obvious bottleneck forcing us to make special considerations.

Any system with heavy use should have tempdb on its own disk array, seperately from where user databases are located. Obviously you'll want to use as many spindles as possible using striping in raid 10. (raid 1 can be used for the log)

Everytime SQL Server is restarted, tempdb is recreated and it size reverts to the effective defines size, which defaults to 8MB. Since this will likely be too small for most databases, it will grow at 10% per growth creating small files that will likely be fragmented within the file system. Since processes will need to wait around for the file to grow, it is suggested that you set this to an appropriate size for your database and workload.

In order to determine what is an appropriate size you can observe the size of tempdb when the system is under load. You can then alter the SIZE parameter so that the effective defined size is more apropriate. Once this is set you won't have to worry about autogrowth until the tempdb gets full which ideally would only happen during irregular activity.

Monday, May 07, 2007

Temporary Tables, Table Variables, Table Expressions - Part 3/4

Table Expressions
In addition to physical temp tables we also have logical temporary tables which are merely virtual materialization of interim sets (opposed to physical materialization in temporary tables and table variables). Table expressions which include derived tables, common table expressions (CTEs), views and inline table-values UDFs give you this capability. This article will discuss situations where you may or maynot want to utilize these table expressions.

In general, table expressions should be used in cases when you need the temporary objectly for simplification. Table expressions should also be used when you only need to access the temporary data once or a limited number of times and do not need to index the interim results. When you actually run a query with a table expression, the query optimizer actually merges the underlying query with the outer one, querying the underlying tables directly.

Other than simplification, there will be some cases where you'll want to table expressions in order to improve performance -in these cases the optimizer might generate a better plan for your query compared to other alternatives.

In terms of scope and visibility, derived tables and CTEs are available only to the current statement, while views and inline UDFs are available globally.

Temporary Tables (#Tables)
Table Variables (@Tables)
Table Expressions (CTEs)

Sunday, May 06, 2007

Temporary Tables, Table Variables, Table Expressions - Part 2/4

Table Variables
Many would agree that table variables are some of the least understood T-SQL elements. There are manny myths asociated with their use such as table variables being purely memory-resident (not having physical representation) and that/therefore are always preferable to temporary tables. I'd like to explain why these myths are unfounded and explain some situations where you might want to use or not use table variables.

Limitations
Unlike physical abd temporary tables there ae certain limitations placed on table variables.

you cannot make explicit indices on table variables, only PRIMARY KEY, and UNIQUE constraints (CREATE UNIQUE INDEX). In fact you can'tcreate non-unique indices.
you cannot change the structure or table definition once it is declared
you cannot issue SELECT INTO statement against a table variable( however you can use INSERT EXEC).
you cannot qualify a column name with a table variable name
in queries that modify table variables, parallel plans will not be used.

tempdb
Despite popular belief, table variables do have physical representation in tempdb, similar to temporary tables.

Scope and Visibility
The scope of table variables is very well defined, and it is the same as any other variable: the current level and within the current batch only. You cannot access a table variable within innr levels of the call stack, and not even within the other batches within the same level. The limited scope will likely be an important deciding factor when determining whether or not to use temporary tables or table variables.

Transaction Context
Unlike a temporary table, a table variable is not part of an outer transaction; rather the transaction scope is limited to the statement level in order to support statement rollback. When you modify a table variable and the statement is aborted, the paticular statement is undone. However, if the outer transaction for tht statement is undone after the statement is finished, the changes will not be undone. Table variables are pretty unique in this respect and we can use this property to our advantage. Because table variables require less locking and logging there are obvious performance benefits

Statistics
The main factor in choosing wheter or not to use table variables is that the query optimizer does not create distribution statistics or maintain accurate cardinality information. Therefore queries against the table variable, will not use an efficient plan which will obviously be a big problem when you work with larger tables. The upside to using table variables is the loss of overhead from calculating these statistics and having to deal with triggered recompilation.

In the next article I'll discuss table expressions, how they work and when you'll want to use these instead of temporary tables and table variables.

Temporary Tables (#Tables)
Table Variables (@Tables)
Table Expressions (CTEs)

Saturday, May 05, 2007

Temporary Tables, Table Variables, Table Expressions - Part 1/4

Temporary Tables, Table Variables, Common Table Expressions - Part 1.

In this series I'm going to clear up some of the misconceptions and confusion surrounding these tempory data structures that are commonly used to materialize data temporarily. Temporary tables, table variables are often 'abused' due to a lack of knowledge about efficient set based programming (myself included). Hopefully I can provide a better understanding of how these temporary structures behave and in which circumstances you should use each.

Local Temporay Tables
Tempory tables are manipulated in the exact same way as permanent tables, however temp tables are created in the tempdb, reguardless of your session's database context (except if they are small enough and sql server has enough free memory, in which case they will reside in the cache - but don't count on it).

Remember that tempdb's recovery mode is SIMPLE and cannot be changed, this means that all bulk operations involved with temporary are always minimally logged - there is no recovery process for tempdb. One reason to use temporary tables is to take load off of the user database when you need persistant temporary data; we can take advantage of the fact that tempdb is treated differently than user databases.

Scope and Visibility
A temporary table is owned by the creating session, and is only visible to it. The scope of the temp table is limited to the session, therefore other sessions may create temp tables using the same name. Because of this SQL Server will generate its own naming scheme for entries in system tables (ie sys.objects), and these names won't directly correlate directly with the name
you have assigned it.

Within the session, the temp table is only visible to the creating level in the call stack as well as the inner levels, not the outer levels. If you create a temp table in the outermost level, its available everywhere within the session, across batches and within the inner levels. As long as you don't close the connection, you'll have access to the temp table. This can be really useful when you want to pass information to inner levels that don't have input parameters such as triggers. when the creating level gets out of scope the temporary table is automatically destroyed. The scope and visibility of temporary tables are much different than table variables and common table expressions, and will likely influence your choice in using one of these objects over the other.

Transaction Context
Temporary tables are likely to be used in transactions and obvisously behave differently than permanent tables in terms of logging and locking. Remember again that tempdb has no recovery process, therefore there will be minimal logging which only ensure that transactions can be rolled back (but not rolled forward). Unlike, permanent tables, temporary tables can only be accessed by the creating session, therefore there will be substantionally less locking involved.

Statistics.
Unlike table variables, The query optimizer creates and maintains distribution statistics for temporary tables in order to keep track of their cardinality, similar to permanant tables. This info is used to estimate selectivity and determine optimized plans. In order to maintain accurate statistics SQL Server must recompile statistics when the recompilation threshold is reached (determining the recompilation threshold for temp tables will be a whbole different article). This propery will likely will affect your choice of temporary data structures. If you are planning on doing a table scan of your data anyway you might not need to accept the overhead involved with keeping these statistics.

Temporary Tables or Table Variables?
In order to determine which structure to use you must understand the answers t the following questions: 1. Does the optimizer need distribution statistics or accurate cardinality estimations to generate an efficint plan, and if so, What's the cost of using an inefficient plan when statistics are not available. 2. What is the cost of recompilations if you do use temporary tables?

If the table is tiny (only a couple pages) the alternatives are to either 1) Use a table variable resulting in complete svans and few or no recompilations, or 2) use a temporary table resulting in index seeks and more recompilations. The advantages of seeks versus scans may be outweighed by the disadvantages of recompiles, or vice versa. I'll talk about table variables more in my next article.

Part 1: Temporary Tables (#Tables)
Part 2: Table Variables (@Tables)
Part 3: Table Expressions (CTEs)

Thursday, May 03, 2007

Prevent Data Loss when Changing Datatypes

Understanding data types in SQL server is required for anyone who is concerned with their database's functionality and/or performance (pick your metric). Understanding the differences and properties of these datatypes is useful for anyone who works with the database: dba, datmodeler, or developer. The time it takes to learn about the datatypes in depth (down to the internal) is time well spent.

Choosing SMALLDATETIME over DATETIME can cause a 1s error due to rounding while choosing DATETIME over SMALLDATETIME will double the size requirements for that column (if you think 2bytes per row is trivial, you've never worked on a large db :p).

Likewise, making schema changes to production databases must be well thought out and datatypes must be understood in order to prevent data loss.

A simple way to prevent data loss is to test out your proposed schema changes in a small temp table. By turning on the STATISTICS I/O option for the session will allow you to view the I/O of the change. If no I/O is reported, you'll be assured that the change didn't touch the base data; the operation will be fast. If your change doesn't require physical access to the base data you can be reassured that you aren't losing any of your valuable (?) data.

Evan Reiser's Blog