Tuesday, April 2, 2013

Task 21. Data Processing Techniques





Ever wondered how to process millions of rows daily?
SQL Server 2012 comes with new TSQL functions and index enhancement to help solve high volume data processing with techniques that will be kind to the server. 

The main problem is related to:
 1. High Volume Data ( size is relative)
2. Business Requirements ( complex business logic, duration, data availability, scalability, configurable, rerunable)

The DBA becomes crazy when comes about disk space, memory usage and CPU bottlenecks. They need to care about availability/response time, risks reduction/avoidance and governance.

You need to know all the details about SQL Server environment: information about recovery model, utilisation, hardware environment: number of processors, memory, network speed, disk configuration.

Useful techniques:
1. Reliable pipe ( batch processing using T-SQL)
2. Fast pipe (SSIS, BCP, BULK INSERT)
3.Divide and conquer (Balanced data distributor with SSIS, Partitioning)

The idea of batching in T-SQL is to break the problem into smaller pieces:
1. Table Lock duration reduced
2. Less rows to rollback
3.Less TempDB consumption
4.Control on error handling
5. Customised visibility on progress


Batching SQL is great for:
1. Transferring data in the same instance
2. Low priority data transfer with long duration
3.Batching scripts can be converted to stored procedures - ideal for replication
4. TempDB size is limited
5. Transaction Log is generally small for SIMPLE or BULK LOGGED recovery mode

Break large delete operations into chunks  is an article for daily operations data process.

There are some tricky scenarios that can not be done using batching like get a unique list of number of columns from 100 million rows or joining 100 million rows with another 100 million rows on nvarchar(500) columns.

In SQL 2012 there are column store indexes which group store data for each column and join all the columns to complete the whole index. Enable faster performance for common data warehousing queries such as filtering, aggregating, grouping and star-join queries.

CREATE NONCLUSTERED COLUMNSTORE INDEX [ISCS_DailySpeed_ColumnStore] ON [dbo].[DailySpeed_ColumnStore] ( [LocationID], [CaptureDateID],[DownloadKbps], [UploadKbps]) WITH (DROP_EXISTING = OFF) ON [PRIMARY]
GO

SSIS -Data flow task
Minimally Logged Inserts
1. Transaction Log keeps track of extent allocations and metadata changes only