Saturday, June 27, 2009

Split Content of a Table Evenly on Files with Same Number of Rows

This is the second part of series of 2 posts dealing with data export to flat files. In the first part I showed an approach to solve this situation: How could the content of a source table be split evenly on a given number of files?
In this post, I will show how you could follow a similar approach when trying to split the content of a table into a number of files, but where each file has to have the same number of rows. This is actually a real problem someone posted in the SSIS forum:
I have this table, with about 8 million rows in SQL Server, and I would like to split it into four different text files, with about 2 million rows per file. I'm thinking conditional split, but I do not know how I can base the export on how many rows that is processed. Can anyone help me...?

Since in this scenario the data is already in a table, we will follow a similar approach than in the previous post. This time we will use the ROW_NUMBER() T-SQL function to assign enumerate all rows in the source table. From books on line describes ROW_NUMBER() function as:
“Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.”

Then, if we want to place 1000 rows on each batch/file, the source query will look like:
With CTE_DimCustomer_MaximumRowsPerPartition
(GroupRowNumber
,[CustomerKey]
,[GeographyKey]
--
)
As
(
Select ((ROW_NUMBER() over(order by CustomerKey)-1)/1000)+1,
[CustomerKey]
,[GeographyKey]
--
from dbo.DimCustomer
)
Select * from CTE_DimCustomer_MaximumRowsPerPartition
And as in the previous post, we use the power of SSIS expressions to parameterize this query. The expression references a second variable called ‘MaximumRowsPerFile’. Is this variable the one you would change at run time to accommodate a different number of rows per file.

Now, we still have an issue to solve. We have to tell the package how many files are going to be generated, and since the number of files would depend on the number for rows per file we choose, we need to calculate it upfront. The way I did I was using an execute sql task to:

-- Query via expression
-- Find out how many group of rows (files) are
-- required based on the maximum numbers of
-- rows each group (file) should have

SELECT ceiling(cast(Count(*) as decimal)/1000) As NumberOfFiles
FROM [dbo].[DimCustomer]AdventureWorksDW



This is how the control flow and the variables inside of the package look like:




You can download the sample package here (You would need SSIS 2005 and AdwentureWorksDW DB):

Split Content of a Table Evenly on a Given Number of Files

Exporting data to flat files using SSIS is certainly straight forward. But from time to time we may find more challenging requirements. For instance, take these 2 scenarios:

1. How could the content of a source table be split evenly on a given number of files?

Or perhaps,

2. How could SSIS be used to export the content of a table to a number of files where the number of rows per file cannot exceed certain number?

In this post we will see how you could use the power of SQL language, and SSIS of course, to help solving the scenario 1 described above. I will try to address the scenario 2 in a later post.
In this example, we will use AdventureWorksDW sample DB, that you can download it from CodePlex.

We will use, NTILE() function, a so called anlytic function introduced in SQL Server 2005. Looking into books online, one can read:

“Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.”

So, it sounds like we could use this function to add a new column to break the source record set into the number of groups we need. Let’s start with a query like this:
USE [AdventureWorksDW]
GO
With CTE_DimCustomerNtile (GroupRowNumb
,[CustomerKey]
,[GeographyKey]
--
)
As
(Select Ntile(5) over(order by CustomerKey)
,[CustomerKey]
,[GeographyKey]
--
from dbo.DimCustomer
)
Select * from CTE_DimCustomerNtile

The query above adds an extra column to the data set that indicates the group number to which each row belongs to, in this example that is a column with values from 1 to 5. This is because we use ‘5’ as argument of the NTILE() function.

But, you don’t always want to split the data into 5 groups, right? well, that is the part that we will solve with SSIS. In SSIS we can place the query in a variable and use an expression. By doing so, we open the door to a number of posibilities to have a different number of groups (or files to be generated) each time the package runs. In this example, we will have a second variable ‘TotalNumberOfFiles’ that will dictate how many files are going to be generated.

Here is a screenshot of the variables and the control flow inside of the package I created:




Now, few key things:
• The source query is executed once per file to be generated
• The source query sets the where clause dynamically (via expression) to pull rows from a single group on each iteration
• The flat file connection manager used by the destination also uses an expression to create the files with different names. The convention I used is: DestinationFile_X_Of_Y; where ‘X’ is the current iteration/group and ‘Y’ is the total number of files to be generated.
• You can use the /SET option of DTExec or a package configuration to change the value of ‘TotalNumberOfFiles’ variable at run time so the total number of files is dynamic – This is not included in this example, so that is your homework.
• The source table does not have to be a SQL Server table, the only requisite is that the source data live in a DB where the NTile function is available (and that is the case of most DB vendors)

Here are some caveats:
• While the query only retrieves the rows within a single group on each iteration, the DB engine still has to deal with the full rowset each time. If you are dealing with very large tables, you should spend some time looking at the query execution plan, and perhaps getting some input from your DBA. Remember this query is executed once per file!
• This approach forces you to have the source data in a table, which may not be always the case. If that is you, then you could probably create a table to stage the data first, or perhaps use an alternative approach (see below)
• Since we used a for loop container, then package only writes into one file at the time. A more efficient approach would be to write into multiple files at the same time. Unfortunately, that is not trivial in SSIS, given the total number of files is known only at run time.

So, is this the only way? Certainly, it is not. I found other blog post that tackles the issue in different ways:
Jamie Thomson has example using conditional splits and script components


I hope you find this example helpful, and if you think this can be done in a different way, please let me know.

You can download the sample package (created in SSIS 2005) here:



Stay tune for the second part where we look at the scenario 2 described at the begining of the post.

Friday, May 08, 2009

SSIS: ForEach Loop - Looping through multiple file types

It has been a while since my last post, and that is because things at home have been keeping busy after Diego, my 3rd child, was born few weeks ago. The good thing is that I finally manage to squeeze some time to write this quick post.

The ForEachLoop container is very functional when trying to process multiple files of the same type. You can easily a ForEach file enumerator and use a wildcard to, let’s say, process all files with a TXT extension by entering “*.exe” in “Files” property of the task like this:



But what if you need to specify more than one file extension? Something like: “*.txt, *.csv” will not work inside of the ForEach container. However, you could still accomplish this goal by looping through all files ( Files=*.* ) and then using a precedent constraint to evaluate whether the file has to be processed or not – Similar to this approach -.

Here is how you configure the precedence constraint:








Here is the expression I used in the precedence constraint:
UPPER(SUBSTRING(@FileName,FINDSTRING(@[FileName],".",1)+1,LEN(@FileName)))=="TXT" || UPPER(SUBSTRING(@FileName,FINDSTRING(@[FileName],".",1)+1,LEN(@FileName)))=="CSV"

Few things to note:
1. In this example, the @FileName Variable only stores the file name (not the path). If you store the path, then you have to modify the expression little bit to account for the path part.
2. The sequence container is empty. Its only function is to serve as placeholder to create the precedence constraint.
3. The ForEach loop container is looping through all files, so make sure you test this approach before implementing it against folders with too many files. Honestly, I don’t think there would be significant performance degradation.

Please let me know if you find this sample helpful, or feel free to share an alternative approach.

Wednesday, January 07, 2009

Don't miss this out: SQL Server 2008 event in Charlotte

A fellow MVP and co-worker John Welch and I have been working with our local PASS chapter and other SQL MPVs from the region to host a SQL Server 2008 Firestarter event here in Charlotte on January 15.
This is a half day event with 6 technical sessions focused on SQL Server 2008 new features, loaded with great speakers and lot of prizes. You can read more and get registered here:

http://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032400468&EventCategory=1&culture=en-US&CountryCode=US


I hope I can see you there!

Wednesday, November 19, 2008

SSIS articles, videos and samples

Douglas Laudenschlager from the SSIS documentation team has a great post on his blog with links to a number of great SSIS articles, videos and sample code; that I think it is worth to check out.

You can find his post here.

I have updated my SSIS Help post to include a link to his blog.

Enjoy it!

Sunday, October 19, 2008

SSIS articles in MSDN

Several weeks ago I wrote an article about SSIS configurations that got published in MSDN: Understanding Integration Services Package Configurations. The article covers all available types of package configurations, its differences and some other facts that would help you on making the best choice when adding configurations to you packages. Mine is just one in a set of articles that a group of SQL Server MVPs put together to concur with the launch of SQL Server 2008.

You can find all articles here: http://msdn.microsoft.com/en-us/library/cc872864.aspx

I hope you find them helpful!

Tuesday, October 14, 2008

SSIS: Compress Files Using Execute Process Task

From time to time people ask how you could compress or extract compressed files in SSIS. While SSIS does not have any built-in taks to do that, you can certainly use execute process task to run a command line using the utility file archiver software of your choice. In the example below, I am using a 7-zip command line to compress a file. 7-zip is open source software (yes, it is free!); but you could use any other tool that provides support for command line.


In this example, a file c:\temp\test.xlsx is added to a file c:\temp\test.zip


Notice that the excutable path/file provided in Executable property must exist in any server runing the package (no just in your development environment). Also, remember that most properties of the task can be affected via expressions, that would come handy for changing the path and or file names at run time.

The argument property must be changed to match the proper syntax, depending on your tool of choice.

Wednesday, October 08, 2008

Speaking at SQL Saturday in Greenville, SC

The SQL Saturday in Greenville, SC is just around the corner, and if you are nearby, I would like to invite you to participate. This is a day long training event packed with excellent content and speakers (plus me :) ), and the best of all, it is absolutely free.

I will be delivering a presentation in business dimensional modeling, so I hope to see you there!

Event Info:
Date: Oct 11, 2008
Location: ECPI Greenville Campus, 1001 Keys Drive #100, Greenville, SC 29615
http://www.sqlsaturday.com/

Saturday, June 07, 2008

SQL Server 2008 RC0 Available for Download

UPDATE: SQL Server 2008 RC0 download is available to the general public now:

http://www.microsoft.com/downloads/details.aspx?FamilyId=35F53843-03F7-4ED5-8142-24A4C024CA05&displaylang=en

Yes, I just read a post in Andrew Fryer blog about that: RC0 for SQL Server 2008 is out! But for now it is available only for MSDN and Technet subscribers and other select group of users. If you have an active subscription, you may want to give it a try while is still fresh. For those without subscription you can either get one, or be little more patient until RC0 is available to the general public.

BTW, did you notice the new SQL Server logo? I find it really nice!

Monday, June 02, 2008

SSIS Suggestions: Lookup Enhancements

Given that the SSIS f2008 is pretty much a closed deal from the stand point of adding new features, and knowing that the SSIS team is already looking at the road ahead, I would like to present a list of improvements that is the product of my interaction with the user’s community and personal experience using SSIS and a competing ETL tool.

I will start with the lookup transformation, because in my opinion, it is one of the most popular/helpful/great components than a developer has in the toolbox when using an ETL tool, and SSIS is not the exception. While the SSIS 2008 version is bringing major improvements in this transformation, it is also leaving room for some more:

Un-sensitive Lookup Transformation: It would be ideal to have a property to tell the transformation whether the join should be case sensitive or not. That would save the use of extra transformations/steps to change the column values participating in the join to upper or lower case. This has already been requested at the connect site:

Case insensitive LOOKUPs and SSIS - SQL Server 2008 - Add Case INsensitive search ability to lookup component

Ability to use greater than and less than operators in join clause of lookup transformations with full or partial cache mode. Currently, the only way to use these operators in the join clause is by enabling memory restriction (non-cache mode) in the advanced properties which degrades performance to the extent of making this approach unsuitable even for medium size workloads. Since the only related item in SQL Server Connect site I could find was already closed with a “won’t fix” resolution: Range Lookup in SSIS

I went ahead and opened a new one:
SSIS lookup: Allow less than and greater than join operators with Full/partial cache support

Better handling of duplicate lookup values. Currently, if duplicate values are found while building the lookup cache, only a warning message is generated in the execution log of the package (if enabled). Having the ability to tell the lookup transformation whether to fail or not the execution when duplicate values are found would be desired. If the lookup is configured not to fail when duplicates are present, it would be great if you could control which of the duplicate values should be used, e.g.: First value, last value, any value. A similar request already posted in connect site:

SSIS Lookup: duplicate key values should raise an error

and a new one that I opened to include the extra options not mentioned in the previous one: SSIS Lookup: Duplicate values handling

Ability to append data to Cache file. In SQL Server 2008 you have the ability of building a cache file to be used by a lookup transformation and even reuse-it on subsequent executions of the package. Unfortunately, you cannot append data once the file has been created. The the only option would be to recreate the file. BTW, John Welch has posted a work around on his blog that you may want to check out. Suggestion posted in Connect site: SSIS: Dynamic lookup cache


If you think these enhancements are worth it (or not), I would like to ask you to click in the links above to rate them and add your comments. If you think I missed an important one, just go ahead and open a new suggestion in the connect site and if you would like so, I will include a link to it on this post.


Saturday, May 17, 2008

T-SQL: LEAD() and LAG() functions

…not for now, but if you want to see a workaround keep reading.

You may already know that there are justa few functions in SQL Server 2005 and 2008 that support the OVER clause, and sadly Lead() and Lag() are some of the missing ones.

If you have always written queries in SQL Server, then you may have not missed such type of functions. But, for me it is a big deal as I have used them extensively in the past (when working on non-SQL Server databases obviously) to profile data, or to test the results of the ETL process.

Today, I had to write some queries to test that a data transformation processppp;l; that someone else wrote, was calculating the end date of each row correctly in a type 2 dimension. When you model a dimension as slowly changing dimension type 2, the current row must be expired when a new version arrives. In my scenario, the logic dictates that the end dates should be calculated using the start date of the next version (subtracting a day), and that implies comparing the values of 2 columns from adjacent rows.

So, how do you compare column values from adjacent rows? There are several ways, but you would always end up with self joins and/or sub queries. The way I solved this issue today was via self join embedded in a common table expression and using the Row_Number() as part of the join clause:

First, let's create some sample data:

create table LEADTEST (ID int, START_DATE datetime, END_DATE datetime)

insert into LEADTEST values (100, '2008-05-20', NULL )

insert into LEADTEST values (100, '2008-04-10', '2008-05-19')
insert into LEADTEST values (100, '2008-01-05', '2008-04-09')
insert into LEADTEST values (200, '2008-03-14', NULL)
insert into LEADTEST values (300, '2008-06-18', NULL)
insert into LEADTEST values (300, '2008-05-11', '2008-04-02')
insert into LEADTEST values (300, '2008-04-03', '2008-04-02')
insert into LEADTEST values (300, '2008-03-07', '2008-04-02')

insert into LEADTEST values (400, '2008-05-20', NULL )

Now, I need a query where I can see, in the same row, the END_DATE of that row and the START_DATE of the next one (order by START_DATE ASC). The trick is to use the Row_Number function that partitions the data set by ID, and then use that resulting column in a self-join:

WITH MyCTE (ID, START_DATE, END_DATE, RowVersion)
AS(
SELECT

ID
,START_DATE
,END_DATE
,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY START_DATE ASC) RowVersion
FROM LeadTest
)
SELECT BASE.ID
,BASE.START_DATE
,BASE.END_DATE

,LEAD.START_DATE LEAD_START_DATE
,DATEADD(dd,-1,LEAD.START_DATE) EXPECTED_END_DATE
FROM MyCTE BASE
LEFT JOIN MyCTE LEAD ON BASE.ID = LEAD.ID
AND BASE.RowVersion = LEAD.RowVersion-1


To simulate the effect of the LEAD() or LAG() function you just need to change the last line to add or subtract the number of preceding/succeeding rows you want to access (this example uses LEAD.RowVersion-1).
The figure below shows the resulting data. See how the LEAD_START_DATE shows the START_DATE value of the following row.


BTW, there are already several requests open at the SQL Server connect site to enhance the OVER clause and to add more of these window functions like LEAD and LAG in SQL Server. If you think that is worth it, click in the links below and add your vote and comment:

ORDER BY for aggregates:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254387

LAG and LEAD functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254388

TOP OVER:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254390

Vector expressions for calculations based on OVER clause:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254391

ROWS and RANGE window sub-clauses:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254392

DISTINCT clause for aggregate functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254393

FIRST_VALUE, LAST_VALUE functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254395

Progressive ordered calculations:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254397

Thursday, April 24, 2008

Blog Traffic Statistics

In a recent conversation with a few fellow SSIS bloggers, we said it would be cool if we could share some traffic stats of our blogs, and perhaps compare what are our most visited posts. I thought that would be interesting, but I had a problem: my blog is hosted by Blogger, and as far as I knew, Blogger did not offer any sort of stats. Then, I decided to do a quick search, and Google Analytics came across. After getting some feedback from my friend Phil Brammer about the service, I decided to give it a try, and in less than 10 minutes I have created my Analytics account and my blog traffic being tracked. Today I reviewed the activity and I was amazed for the friendlier graphs and reports I got [1]. Here are few screen shots after 2 days of tracking:





However, it is worthy to point out that Microsoft is working in a service offer called Microsoft adCenter Analytics Registration that, to the best of my understanding, could be a major competitor of Google Analytics. If you want to give a try to the Miccrosoft offer, you have to fill out a registration and wait for a response (you can do it here), as it is till in Beta. I filled out mine and am waiting, I will report back when I get an answer.

*Disclaimer: I am a rookie blogger, have no idea about web traffic metrics or so, and get amazed very easy :)