Saturday, December 28, 2013
Why SSIS is a mystic art for many .NET Devs
First a bit about my background might help. I come from a developer background. Specifically, I work as a senior web development on large ASP.NET web sites. I have worked in .NET since just before the release 2.0 (8 years) I have a Information Technology Degree with a Database focus. I largely fell into web development due to its popularity and stayed due to my investment in the platform. SSIS is such a disappointment because it doesn't fit in my world. I instead tack it onto my world with its own set of weirdness sort of like any other framework or language like XML, Regular Expressions, etc
Wednesday, September 25, 2013
Designing an OLTP database by Understanding OLAP Dimension Attribute Types
Introduction
First I may be throwing around some acronyms that are foreign for new comers or for folks that do not focus heavily on database design. Lets go ahead and get those defined to get everyone on the same page.- ETL (Extract Transform Load)- This is a process by which data is ingested from some source system to some staging or target database.This can happen over an Service Oriented Architectrue (SOA) or use flat files such as XML, CSV, Excel spreadsheet, etc.
- OLTP (Online Transaction Processing database- If you are programming against a database for an application and do not know it type, you are using an OLTP. Database normalization is usually important. The primary use is for data storage for operations based systems such as Content Management Systems, Entity Resource Planning, User facing systems etc.
- OLAP (Online Analytical Processing) database- Used for reporting, business intelligence and decision support systems (DSS). De-normilazation is usually preferred as data accumulates instead of changes to existing data. More on this later.
Using Type 2 Attribute Changes in OLTP
In OLAP database there are two types of what they term as 'slow changes' to any data attribute. Type 1 involves retroactively changing data that effect how historical data is reported retroactively. Imagine you change the State in a Home Address record for a given user. If you store one Home Address for a user there is no way to account for that change over time if the User Table has a foreign key field for Home Address without further data modeling.Junction Tables jumble
We could use a junction table with a surrogate key. So for our ongoing example, the UserAddress Table could would have three fields at a minimum: UserId, AddressId, and UserAddressId. The first two fields would be a candidate key with foreign key relationships back to the User and Address tables respectively. We now have to decide whether we allowing only Type 1 attribute changes to the Address table or certain fields in the Address table or if we are allowing Type 2 changes. That is maybe its ok if the zip changes to use the same record (thus allow type 1 change) but changing any other attribute like street, state or city requires a new record. In the case where you ONLY allowing Type 1 changes set a unique key constraint on UserId and AddressId. You could also have a reason field in your UI that allows the user to let the back end know whether it was a correction or a change of address.Danger: Avoid not using a surrogate key. As this will limit options for when you decide to go from allowing Type 1 changes to allowing Type 2 changes.
If you are allowing Type 2 changes than you have to support the ability for multiple records for any given UserId and AddressId combination. At a minimum you need to decide if there is a way of categorizing or expiring historical records.
Expiring Type 2 Design Options
- IsCurrent flag- Add an additional field to the junction table that marks the current address record for the user. This requires you manage this flag when a new record is added for a given canidate key combination.
- Integer Sequence field- This gives a chronological order to changes. Requires looking for the max old value before inserting a new record. I recommend looking at other event based system concepts that you could bake into your system early.
- Date based Sequence field- Using Date and potentially time to represent a sequence.
- Address Field FK- in the User table such as CurrentHomeAddressId.
Does Downstream Matter
I think it is important to note that regardless of downstream concerns you should only implement a Type 2 change if it is important for the direct users of your database. There is always a significant cost to this feature. The real shame is you either decide to care about retaining this level of detail or you have a watermark when it is decided it is of value. Depending on database backup retention policy; you could come up with a facsimile of the past to a certain point of course. Often times you lose a certain amount of granularity depending on what transactional data you capture. When an OLAP database is designed with any number of OLTP sources they will design a way of doing ETL that will monitor your system for changes regardless of whether you do Type 1 or 2 changes. I always recommend having standard timestamp fields at a minimum for when a record was created and last time it was modified. With this minimum consideration, Even if you only allow Type 1 changes an ETL process can still do Type 2 changes in the OLAP database regardless of your choice. This will likely give them more granularity than the timestamp they can get from a backup, and of course the OLAP folks are at the mercy of the DBAs at that point.Thursday, July 11, 2013
What is in a Database Field Name?
I dislike Database design that use the table name as a
prefix for all fields in the table. Example:CREATE TABLE
MyTable(MyTableID int Primary Key,MyTableModified
DATE,MyTableDescription Varchar(500),MyTableCreatedBy
int...)
As opposed to:CREATE TABLE MyTable(Id int PrimaryKeyModified DATE,Description Varchar(500),CreatedBy int...)
FKs are fine but I would recommend being creative with Names. CreatedBy is a FK to the User table. But notice I didn't call it UserId. Why? What happens when you start tracking UpdatedBy?
Well what about recognizing tables in a query with say 10+ joins. Use great names for Aliases instead. You are using aliases in this case? :) I have seen people error on one of two sides. SQL Server 2008 + or newer people tend to lean towards verbose or even avoiding aliases due to intellisense. That's if they haven't done something to break SQL Server Management Studio intellisense. Or people tend to abbreviate table names. Which is fine if you are consistent with Alias names. The great thing about leaning on Aliasing heavier is it is less permanent than having to rename every field because you decide a table needs to renamed. Most of time you will not rename fields or tables. That is why naming fields and tables well is such a big deal.
As opposed to:CREATE TABLE MyTable(Id int PrimaryKeyModified DATE,Description Varchar(500),CreatedBy int...)
FKs are fine but I would recommend being creative with Names. CreatedBy is a FK to the User table. But notice I didn't call it UserId. Why? What happens when you start tracking UpdatedBy?
Well what about recognizing tables in a query with say 10+ joins. Use great names for Aliases instead. You are using aliases in this case? :) I have seen people error on one of two sides. SQL Server 2008 + or newer people tend to lean towards verbose or even avoiding aliases due to intellisense. That's if they haven't done something to break SQL Server Management Studio intellisense. Or people tend to abbreviate table names. Which is fine if you are consistent with Alias names. The great thing about leaning on Aliasing heavier is it is less permanent than having to rename every field because you decide a table needs to renamed. Most of time you will not rename fields or tables. That is why naming fields and tables well is such a big deal.
Subscribe to:
Posts (Atom)