JR Kincaid: 2013

Saturday, December 28, 2013

Why SSIS is a mystic art for many .NET Devs

First a bit about my background might help. I come from a developer background. Specifically, I work as a senior web development on large ASP.NET web sites. I have worked in .NET since just before the release 2.0 (8 years) I have a Information Technology Degree with a Database focus. I largely fell into web development due to its popularity and stayed due to my investment in the platform. SSIS is such a disappointment because it doesn't fit in my world. I instead tack it onto my world with its own set of weirdness sort of like any other framework or language like XML, Regular Expressions, etc

CLR dependency without exposure of capabilities outside of a SSIS package

The SSIS platform was built for BI folks with zero development experience. There is zero carry over with .NET to SSIS. The components from what I have been able to gleam are not reusable by a vanilla .NET application. Want to parse a CSV and show the success and failure rows in a UI? Don't want to introduce a bunch of junk: SSIS package, a table to stage success rows, a table to stage failure rows. Your out of luck. I think, if makers of the black boxes that are SSIS control flow and data flows components would expose the underlying glue as classes that .NET applications could use, we would have one standard mechanism for reading from different sources across the platform. We would also not have to introduce a third type system here is how I am counting: your programming language's type system, SQL Server, SSIS DT_* types. Please correct me if I am wrong. I love to hear a bout it.

SSIS uses Windows Workflow well kinda

If Microsoft had such an investment in Windows Workflow (WF), why did they make BIDS now SSDT. It would of been a good story to make SSIS Catalog on SQL Server as one way to host WF with an SSIS being a implementation of WF Framework. Open source it to show how it can be done. Also if you want to use say a Flat file data source in your WF package but outside of SSIS, why not?

Conclusion

I know this is kind of a rant. I know there are more low level ways to do most of what BIDS now SSDT does in .NET. I think the correlation between the two technologies could be cleaner and more transparent. I get BI Developers just want to shell in occasionally to .NET with a Script task. I think it would increase SSIS usage if there was a clear platform independence for some features of SSIS. Need to shred a Excel document in SSIS or your backend .NET application. Let's have a way to pull away the GUI and interact with the Excel document shredder directly and consistently in both applications.

Wednesday, September 25, 2013

Designing an OLTP database by Understanding OLAP Dimension Attribute Types

Introduction

First I may be throwing around some acronyms that are foreign for new comers or for folks that do not focus heavily on database design. Lets go ahead and get those defined to get everyone on the same page.

ETL (Extract Transform Load)- This is a process by which data is ingested from some source system to some staging or target database.This can happen over an Service Oriented Architectrue (SOA) or use flat files such as XML, CSV, Excel spreadsheet, etc.
OLTP (Online Transaction Processing database- If you are programming against a database for an application and do not know it type, you are using an OLTP. Database normalization is usually important. The primary use is for data storage for operations based systems such as Content Management Systems, Entity Resource Planning, User facing systems etc.
OLAP (Online Analytical Processing) database- Used for reporting, business intelligence and decision support systems (DSS). De-normilazation is usually preferred as data accumulates instead of changes to existing data. More on this later.

This article is great not only for people with downstream OLAP database but for learning an advance database concept for any OLTP that stores historical data. That is any system that serves as a system of record.

Using Type 2 Attribute Changes in OLTP

In OLAP database there are two types of what they term as 'slow changes' to any data attribute. Type 1 involves retroactively changing data that effect how historical data is reported retroactively. Imagine you change the State in a Home Address record for a given user. If you store one Home Address for a user there is no way to account for that change over time if the User Table has a foreign key field for Home Address without further data modeling.

Junction Tables jumble

We could use a junction table with a surrogate key. So for our ongoing example, the UserAddress Table could would have three fields at a minimum: UserId, AddressId, and UserAddressId. The first two fields would be a candidate key with foreign key relationships back to the User and Address tables respectively. We now have to decide whether we allowing only Type 1 attribute changes to the Address table or certain fields in the Address table or if we are allowing Type 2 changes. That is maybe its ok if the zip changes to use the same record (thus allow type 1 change) but changing any other attribute like street, state or city requires a new record. In the case where you ONLY allowing Type 1 changes set a unique key constraint on UserId and AddressId. You could also have a reason field in your UI that allows the user to let the back end know whether it was a correction or a change of address.

Danger: Avoid not using a surrogate key. As this will limit options for when you decide to go from allowing Type 1 changes to allowing Type 2 changes.

If you are allowing Type 2 changes than you have to support the ability for multiple records for any given UserId and AddressId combination. At a minimum you need to decide if there is a way of categorizing or expiring historical records.

Expiring Type 2 Design Options

IsCurrent flag- Add an additional field to the junction table that marks the current address record for the user. This requires you manage this flag when a new record is added for a given canidate key combination.
Integer Sequence field- This gives a chronological order to changes. Requires looking for the max old value before inserting a new record. I recommend looking at other event based system concepts that you could bake into your system early.
Date based Sequence field- Using Date and potentially time to represent a sequence.
Address Field FK- in the User table such as CurrentHomeAddressId.

Does Downstream Matter

I think it is important to note that regardless of downstream concerns you should only implement a Type 2 change if it is important for the direct users of your database. There is always a significant cost to this feature. The real shame is you either decide to care about retaining this level of detail or you have a watermark when it is decided it is of value. Depending on database backup retention policy; you could come up with a facsimile of the past to a certain point of course. Often times you lose a certain amount of granularity depending on what transactional data you capture. When an OLAP database is designed with any number of OLTP sources they will design a way of doing ETL that will monitor your system for changes regardless of whether you do Type 1 or 2 changes. I always recommend having standard timestamp fields at a minimum for when a record was created and last time it was modified. With this minimum consideration, Even if you only allow Type 1 changes an ETL process can still do Type 2 changes in the OLAP database regardless of your choice. This will likely give them more granularity than the timestamp they can get from a backup, and of course the OLAP folks are at the mercy of the DBAs at that point.

Thursday, July 11, 2013

What is in a Database Field Name?

I dislike Database design that use the table name as a prefix for all fields in the table. Example:CREATE TABLE MyTable(MyTableID int Primary Key,MyTableModified DATE,MyTableDescription Varchar(500),MyTableCreatedBy int...)

As opposed to:CREATE TABLE MyTable(Id int PrimaryKeyModified DATE,Description Varchar(500),CreatedBy int...)

FKs are fine but I would recommend being creative with Names. CreatedBy is a FK to the User table. But notice I didn't call it UserId. Why? What happens when you start tracking UpdatedBy?

Well what about recognizing tables in a query with say 10+ joins. Use great names for Aliases instead. You are using aliases in this case? :) I have seen people error on one of two sides. SQL Server 2008 + or newer people tend to lean towards verbose or even avoiding aliases due to intellisense. That's if they haven't done something to break SQL Server Management Studio intellisense. Or people tend to abbreviate table names. Which is fine if you are consistent with Alias names. The great thing about leaning on Aliasing heavier is it is less permanent than having to rename every field because you decide a table needs to renamed. Most of time you will not rename fields or tables. That is why naming fields and tables well is such a big deal.