Sunday 31 August 2014

Business Time Temporal Tables

When we last left off, I told you about all the goodness that is System Time Temproal Tables and how they exist to deal with tracking changes over time, but not burden your application with this complexity -- but then left with the annoying fact that it did in fact not address the very first situation that brought this all up - getting a raise after the payroll run - but before the start of the next month, meaning reports run on the current month would still be incorrect as they would reflect details only intended to be relevant for the next month.

Really You could not tell?

Well yes and no... no, because it would be possible for people to run queries which target the time-frame before the payroll run was done -- but realistically in typical situations like this there are going to be many many of these types of "reports" or "procedures" that have the base assumption that if you run a query in a month boundary it should be correct - or simply, if I run query now agnostic of time, it should be correct for now -- now should be implicit when I don't specify time.

It would not be fun being the one who would have to go to all the other teams and tell them that they need to update their products to bind queries to specific dates -- moreover, then deal with complexity of the fact that payroll day will move days from month to month because of those pesky weekends, implying that we would have to create some abstraction everyone could use to find out when payroll happened that month -- its all getting very messy and exhausting ... and this is just a silly fictitious example.

There, must be a better way...

and there is.. enter Business Time Temporal Tables, which is conceptually similar to the System Time Temporal Tables we have spoken about before, but this time DB2 is handing back a bit of control to us, and equally we are taking on a bit more responsibility for this concept of history change.

The extra control that we are given is that in business time, we are given the responsibility for identifying the time at which we want the effect of a change of data to take effect. We are able to post-date our change.

Enough waffle.. lets see it in action

First we need to create our table... but this time, since the dates that we are going to be passing in are in the control of the application, there is no good reason to have a duplicate version history table.. so not too surprisingly when we go the BUSINESS_TIME route, there is less framework we need to put in place

CREATE TABLE tb_Salary ( 
  id              INT NOT NULL, -- employee id
  salary          INT NOT NULL,  -- the data  
  bus_start       DATE NOT NULL, -- create space for our start time
  bus_end         DATE NOT NULL, -- create space for our end time
  PERIOD BUSINESS_TIME(bus_start, bus_end), -- bind the above to the concept of the business time period
  PRIMARY KEY( BUSINESS_TIME WITHOUT OVERLAPS) -- make sure all time frames are unique
)

So what we have again is our table with our raw data, employee number and salary, and then two dates that represent our period. Like when we look at the System Time tables, there is no hidden magic in the scheme regarding these times, but rather we need to explicitly tell DB2 that these are the columns that should be use for the period range with the PERIOD BUSINESS_TIME clause, and then for a bit of data sanity, we constrain our table to guarantee to have a unique time range for each row.

Adding some data into our table is as expected, where we must explicitly define our time ranges

insert into tb_Salary 
    values(1, 100, '2014-01-01', '2015-01-01');

select * from tb_Salary 
    where id = 1 
    order by BUS_START asc;

ID          SALARY      BUS_START  BUS_END
----------- ----------- ---------- ----------
          1         100 01/01/2014 01/01/2015

  1 record(s) selected.

Lets now give our hard working employer a raise, but only make it effective at the beginning of the next month, assuming we are in the 8th month now.

--give a raise ( a normal update )
update tb_Salary
    for portion of BUSINESS_TIME from '2014-09-01' to '2015-01-01'
    set salary = 150
    where id = 1;


select * from tb_Salary 
    where id = 1 
    order by BUS_START asc;

ID          SALARY      BUS_START  BUS_END
----------- ----------- ---------- ----------
          1         100 01/01/2014 09/01/2014
          1         150 09/01/2014 01/01/2015

  2 record(s) selected.

Nice .. so now we see that two rows exist since there are now - from a business perspective - two salaries, depending on the time frame that you are interested in. DB2 also takes care of the "heaving lifting" of splitting the one row into two, and appropriately adjusting the date ranges to be correct within the constraint we added to our table that all date ranges must be unique.

On the topic of row splits.. its feels pretty intuitive that a row split would happen for an update... but what about a delete... what if our employee decides to take an unpaid sabbatical for a month... could we simply remove that period without having to concern ourselves with the complexity of the number of rows that might need to be affected... let see

--takes a sabbatical ( removing from the table and getting a row split )
delete from tb_Salary
    for portion of BUSINESS_TIME 
    from '2014-10-01' to '2014-11-01'
    where id = 1;
 
select * from tb_Salary 
    where id = 1 
    order by BUS_START asc;

ID          SALARY      BUS_START  BUS_END
----------- ----------- ---------- ----------
          1         100 01/01/2014 09/01/2014
          1         150 09/01/2014 10/01/2014
          1         150 11/01/2014 01/01/2015

  3 record(s) selected.

ohh its almost like someone else is doing all the hard work for us here :)

Now HR is able to run queries on the salaries for a specific payroll not not be affected by raises for the next period

select * from tb_Salary 
    for BUSINESS_TIME as OF '2014-01-01 
    where id = 1';

ID          SALARY      BUS_START  BUS_END
----------- ----------- ---------- ----------
          1         100 01/01/2014 09/01/2014

  1 record(s) selected.
  
  --( during sabbatical )
 select * from tb_Salary
    for BUSINESS_TIME as OF '2014-10-01'
    where id = 1;
    
 ID          SALARY      BUS_START  BUS_END
----------- ----------- ---------- ----------

  0 record(s) selected.
 

Or now even to be able other interesting questions like, what the salaries have been for a given period

select * from tb_Salary 
    for BUSINESS_TIME FROM '2014-01-01' TO '2015-01-01' 
    where id = 1;

ID          SALARY      BUS_START  BUS_END
----------- ----------- ---------- ----------
          1         150 09/01/2014 10/01/2014
          1         100 01/01/2014 09/01/2014
          1         150 11/01/2014 01/01/2015

  3 record(s) selected.

hmmm but wait... now we have business context history that we did not have with System Time Temporal Tables.. but we have lost all the goodness of transactional history for any changes.. so while we can see from a business perspective that someone was paid a certain value in a given date range, we don't have the same degree of auditing control that we had with System Time temporal tables in that we would be able to tell exactly when that salary edit took place ( i.e. was that salary change effectively part of backpay).

I am one of those people who tend to not understand the concept of having a cake and not being able to eat it.

Where is my cake?

The good news is the edible cake is in our next article, where we "introduce" Bitemproal tables... I say "introduce" because -- well you already know it all, but you just have to learn the new fancy term for it ;)

Sunday 20 July 2014

Temporal's Infinity

Here is some interesting trivia, lets say you have created a system temporal table, and added in some data, you will see your start and end date automatically updated.
Now clearly the start time, will be the instant the data was added, but the end date, well that is conceptually debatable, since we don't have an end date yet, we will only have that value at some point in the future... or perhaps never. So we need something to represent infinity.

However, in reality, we have little choice, since our schema is set in place, we have to pick something to represent infinity, I guess the two options would have been the largest possible date you can represent, or NULL.

Some what surprisingly, DB2 chose neither of these.. check it out :

tb_YourTable
------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          1 2014-07-19-12.14.16.100732000000 9999-12-30-00.00.00.000000000000

I know what you are about to say, "yes they did choose one, they chose the largest representable date", but no, look more carefully.
No I am not being pedantic about the hours mins and seconds, although that would also count.. there is something more blatant.

How many days are there in December? Yes, 31 hmmm interesting they chose 30 - but why, an oversight they just never bother to fix?

Thankfully not, it appears to have more intelligence than that, and it also explains why the date is also at the beginning of the day, not midnight.

Some languages can only represent years up to 9999, and then add into that the issues of timezones, it would theoretically be possible for the value that represents INFINITY to be read into a structure, adjusted for time zones, and then wrap beyond the end... in effect breaking the temporal table.
By giving yourself a full day, you eliminate this issue.

When will this information be useful to you ?... umm don't know, perhaps a trivia game, but hey nice to know anyway :)

Saturday 19 July 2014

System Time Temporal Tables

Following on from the temporal tables overview lets first look at the concept of System Time Temporal tables.

So we remember that temporal tables are all about keeping the history of changes to data, and what is nice about System Time Temporal tables, is that this is entirely a DB stack concept, it is ( or at least can be ) entirely transparent to the application stack.

But I am getting ahead of myself, lets go a bit slower.

If data is going to have a lifetime, then by definition we need a start time and end time, but what values we choose for those times is still up for debate. The first and most obvious choice is to use the time at which we insert data as the start time, and the time at which we change that data to be its end point. That choice of using the times at which the data was touched is what we mean when we talk about System Temporal Tables.

At this point, it does not take a rocket scientist to spot that DB2 can and should be able to help us out a lot here, since it will have very intimate knowledge of when data is changed. And so it is. At an application level, we can do all our CRUD operations as usual oblivious to the temporal nature of this data, and DB2 will automatically make copies of our data when it has changed, and set the start timestamp and end timestamp accordingly.

But to get this right, we will first have to give DB2 the hook points it needs to do the heavy lifting for us, we will need four things

  1. A column for the start timestamp 
  2. A column for the end timestamp 
  3. A placeholder Timestamp variable for DB2 to use when updating rows. 
  4. A table to record history in Which we are be in a position to supply to DB2 when we create our table.

CREATE TABLE tb_YourTable
(
  -- your data
  my_data INT NOT NULL,  
  -- A place for DB2 to add a start date
  sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,
  -- A place for DB2 to add an end date
  sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
  -- A place for DB2 to save a transaction time when a row is updated, hidden 'cause we dont care about this
  trans_start TIMESTAMP(12) GENERATED ALWAYS AS TRANSACTION START ID IMPLICITLY HIDDEN,
  -- Bind our timestams to DB2 concept of Sytem_Time
  PERIOD SYSTEM_TIME(sys_start,sys_end)
);
-- make a duplicate table for our update history
CREATE TABLE tb_YourTableHistory LIKE tb_YourTable;
-- tell DB2 to bind our table to the history tables
ALTER TABLE tb_YourTable ADD VERSIONING USE HISTORY TABLE tb_YourTableHistory;
So what are the interesting thing that we can pick up from this syntax above : 

First there is no magic behind-the-scenes schema being created, we are in control of all of this, for both the table we are interested in and and the table used to keep history - nice!

There is just some extra syntax that gives DB2 the knowledge to bind the concept of temporal tables to the schema we have created. This is how we give DB2 permission to use our hook points ie the columns that we have created that it can use for the SYSTEM_TIME and the history table we have created which it can use to populate our history.

Clearly we do have to play by some rules regarding the schema, first our TIMESTAMP columns must be TIMESTAMP(12) and our history table must have exactly the same schema as our temporal table ( I like how easy LIKE make this, very DRY )

So now we have a temporal table set up, what happens when we add content to it ?

Well lets insert some data, just like we would if we did not know about this whole temporal thing..

insert into tb_YourTable(my_data) values(1)
tb_YourTable
------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          1 2014-07-19-12.14.16.100732000000 9999-12-30-00.00.00.000000000000

Not too surprisingly, we have data ... but equally unsurprising nothing in our history table....

tb_YourTableHistory
-------------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------

right now lets do an update to this table

update tb_YourTable set my_data=2
 tb_YourTable
 ------------
 MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          2 2014-07-19-12.17.21.539897000000 9999-12-30-00.00.00.000000000000

We now see that our update took place as expected, well almost... notice that the start time was also updated.

Also if we look at our history table :

 tb_YourTableHisory
 ------------------
  MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          1 2014-07-19-12.14.16.100732000000 2014-07-19-12.17.21.539897000000

We can see that it now has some data in it, where the histories start time is exactly the same as our tables original start value and the histories end time is the same as our updates new start value. ie the full time that the old data was valid for.

okay, well that worked as expected.... what about a delete ?

delete tb_YourTable where my_data=2
tb_YourTable
------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------

tb_YourTableHistory
-------------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          1 2014-07-19-12.14.16.100732000000 2014-07-19-12.17.21.539897000000
          2 2014-07-19-12.17.21.539897000000 2014-07-19-12.34.11.079309000000

Okay, so "magically" getting our history into our history table is all well and good, but if we don't read it, then who cares... so lets try and query some data from our history, by running a query on our now empty table.

select * from tb_YourTable for system_time as of '2014-07-19-12.14.17'
tb_YourTable
------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          1 2014-07-19-12.14.16.100732000000 2014-07-19-12.17.21.539897000000  

Notice how we use the "for system_time as of" expression to pull data out of the history table while referencing our data table ( which right now is empty ).

Also notice how the SYS_END time is not some value trying to identify infinity, so we know that the value we have pulled was only true for this time period. There are a few more ways in which we can pull data from our history ( or current table for that matter )

select ... FOR SYSTEM_TIME FROM ... TO ...
select ... FOR SYSTEM_TIME BETWEEN ... AND ...

eg:

select * from tb_YourTable for system_time between '2014-07-19-12.14.15' and '2014-07-20-00.00.00'
tb_YourTable
------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          1 2014-07-19-12.14.16.100732000000 2014-07-19-12.17.21.539897000000
          2 2014-07-19-12.17.21.539897000000 2014-07-19-12.34.11.079309000000

sweet...

hmmm but did I not lead this article off with the promise that the application stack would be able to take advantage of all this temporal goodness without changes.. and so far, to get any real value.. we have to learn how to query in a new way ... hmmm, did I lie ?

No wait there is more :) With the use of a nice register "Current Temporal System_Time"

set current temporal system_Time = '2014-07-19-12.14.16.100732000000'

Then run a temporal agnostic query

select * from tb_YourTable
tb_YourTable
------------
MY_DATA     SYS_START                        SYS_END
----------- -------------------------------- --------------------------------
          1 2014-07-19-12.14.16.100732000000 2014-07-19-12.17.21.539897000000

ohhh did we just do a time shift from under our application nose .. nice

To turn it off, we simply set the time to NULL



set current temporal system_time = NULL
All well and good, but if you cast you mind back to my overview, I spoke about the salary scenario, and if you are astute, you would know that this current solution would not actually solve that problem.. for that we need to look at the concept of Business Time Temporal Tables... but that is another post for the future.

Friday 11 July 2014

Temporal Tables - Overview

The first feature I have looked at in DB2 is the concept of temporal tables, which as its name implies ( albeit with one of those words that makes something sound more complex that it needs to be ) is related to time.

So, what are temporal tables all about, simply the addition of a lifetime concept to data. Each piece of data ( at a row level ) is given a lifetime, i.e. a fixed start date and end date.

While time is clearly the key data behind this feature, I think the core concept is actually better described as tracking changes, and the offering of first order principle support in DB2 to do this.

In some respects you could argue this is like adding version control to your data.

Why do I want to add more time information to my data ?

Yup, sounds like a fair question, since if you poke around I am sure you will find there is already time information for your data, however that is always just the time the data was added / updated etc. What we are talking about here is lifetime of data, not just a single point in time.

Imagine that your boss agrees to give you a raise next month (+1 Boss), further assume that the payroll for this month has already been run ( so only a few days left in the month ). HR might reasonably assume they can just go and update your salary information now with the new value, since "clearly" the next time the value is going to be used is for the next payroll run.

Meanwhile, the auditors come in unannounced and validates the last payroll... needless to say ( but I will anyway ) things will not balance, best they can tell is that you were underpaid -- you of course take the moral high ground and explain to them that it is just a failing of the DB as it does not have temporal concept of lifetime added to your new salary, as that update should only take affect in next months payroll run, if they are trying to evaluate this months payroll run, they needs to use your previous salary information.


What is the solution ?

Clearly there are two immediate solutions to this category of problem
  1. Teach HR to only update the data at exactly the right time. 
  2. Don't make the salary update destructive, keep a history of all changes for given timeframes.
Option 1, is just littered with issues
  1. Some poor HR person needs to be at work midnight on the last day of each the month 
  2. Even if you do find a way to update at exactly the right time ( with a scheduled task so HR can continue to have a family life ), you have lost data. The previous salary information is lost.
Option 2, is clearly far superior, but comes with the potential challenges of having to adjust your entire application stack to take into account the concept of life time, so that queries into the DB know which version data to look at.

This is exactly where temporal tables come in, they offer you built in support creating this change history, with the added benefit of being able to do this with between zero to minimal changes to your existing product application stack above the DB.

What support does DB2 give me ?

There are 3 things that DB2 gives you ( I would argue only 2, but we can get into that later ), each of these I will touch on in future posts ( linked below as and when )
  1. System Time Temporal Tables
  2. Business Time Temporal Tables
  3. Bitemporal Tables

Thursday 10 July 2014

Dipping my toes into DB2

With a new job on the horizon, I will soon be stepping out of the Microsoft bubble, as my current job is a very strong Microsoft shop, and the new one.. not so much :)

One of the new DB technologies that I am going to have to look into is IBM's DB2.

This will be my bread crumb trail of what I learn, and with any luck, what I learn and write will be close enough to correct to help you too ( where "you" might just be my future self )

SELECT *
FROM tb_unknown
WHERE knowledge = 'new';