Nowadays, even when you have a catalog implemented that scans your data, the number of search results on the word ‘customer’ is too high and far from relevant. Scrolling through the first page of results makes you wonder what would suffice your data needs. Is the data properly described, do you know who to go to for questions and are you even allowed to use privacy sensitive data for your use case?
Documenting and maintaining these basic attributes of your data is already challenging and time consuming in itself. And at what granularity do you need to do this? If you listen to legal and compliance; every single attribute of every table of every system must be properly protected. This seems like an impossible task. And it is!
How can you practically approach this?
Start organising your data into datasets!
Classifying and defining all your data attributes is a cumbersome and time consuming task. At some point you should do it for at least the data you share. But waiting for all this to be done is not an option. Business must go on. So how can we overcome the big pile of work in the mean time? The answer is to start organising your data into datasets.
What do we mean by dataset?
There are varying opinions of what a dataset may be. At D8A Directors we consider a dataset as a collection of physical data objects that needs to be managed (governed) together. This can consist of a single or a selection of tables that make up the customer master data or the collection of tables that hold your inbound shipments. It may also be used for unstructured data like video’s or images that were collected during a particular event.
Then governance is applied at all the data together in these datasets. When you combine data into datasets that needs to be physically managed together; you don’t need to manage every individual attribute. The question then is what are these governance requirements that you should think about to combine data?
Dimensions to seperate data into datasets
To be honest, the dimensions to take into account need to be determined in your organisation by the Data Governance department in their policies. The legal grounds that allow you to process data vary per organisation, so there’s no one single set of dimensions. However, to get you started, here’s an example of good practice dimensions:
- Source system
- Data product
- Geographical Region due to regional legislation
- Responsible data owner (single ownership for accountability)
- Purpose limitations
- Retention term
- Business & privacy sensitivity classification
The goal of a dataset is to be able to group physical data for purposes of:
- Logically organizing data
- Data Ownership and Data Governance
- Request and approval against purpose limitations
- Granting and Managing Security / Data Access Rights
- Managing Data Retention and Disposal
- Classifying Data for Privacy
- Classifying Data for Geographical Regions
- To make Reuse of data easier
- Archiving Data to cheaper Storage tiers
Let’s make it practical
Decomposing datasets distinguishes conceptual data models, data objects and the attributes (from a logical model). These should then be linked up to the actual physical data.
An example is shown below on how data privacy classification of attributes affect datasets. Classsifying all attributes is a lot of work while you can manage your data already to a sufficient extent when this is classified at dataset level:
The dataset is an abstraction at a logical level that makes it easier to govern a collection of physical data. With the right information at dataset level like business owner, purpose limitation, business sensitivity, privacy sensitivity, retention term etc, you can start managing access to this data.
Note that the dataset is a logical grouping of data. This does not necessarily mean you must physically seperate different datasets to comply with governance requirements. An example when you must physically seperate the data is in case of a regional requirement to store data within country borders.
Data products & datasets
Taking it one step further: how do datasets relate to data products?
As per the HR example, you can consider this to be a single data product. A data product can expose access to different datasets that the data product holds.
To optimize the storage, you could expose dataset as (virtual) views on the actual data.