Skip to content

D8A directors

Giving you and your data direction

  • Home
  • Content studio
    • Latest articles
    • Podcasts
    • Enterprise Directions
    • Industries
  • D8A Academy
  • Partners
    • Syntho – synthesize data
  • Our story
  • Let’s get in touch!
  •  

Category: Metadata

Posted on 31/05/202331/05/2023

How to make data governance practical using datasets

Nowadays, even when you have a catalog implemented that scans your data, the number of search results on the word ‘customer’ is too high and far from relevant. Scrolling through the first page of results makes you wonder what would suffice your data needs. Is the data properly described, do you know who to go to for questions and are you even allowed to use privacy sensitive data for your use case?

Documenting and maintaining these basic attributes of your data is already challenging and time consuming in itself. And at what granularity do you need to do this? If you listen to legal and compliance; every single attribute of every table of every system must be properly protected. This seems like an impossible task. And it is!

How can you practically approach this?

Start organising your data into datasets!

Classifying and defining all your data attributes is a cumbersome and time consuming task. At some point you should do it for at least the data you share. But waiting for all this to be done is not an option. Business must go on. So how can we overcome the big pile of work in the mean time? The answer is to start organising your data into datasets.

What do we mean by dataset?

There are varying opinions of what a dataset may be. At D8A Directors we consider a dataset as a collection of physical data objects that needs to be managed (governed) together. This can consist of a single or a selection of tables that make up the customer master data or the collection of tables that hold your inbound shipments. It may also be used for unstructured data like video’s or images that were collected during a particular event.

Then governance is applied at all the data together in these datasets. When you combine data into datasets that needs to be physically managed together; you don’t need to manage every individual attribute. The question then is what are these governance requirements that you should think about to combine data?

Dimensions to seperate data into datasets

To be honest, the dimensions to take into account need to be determined in your organisation by the Data Governance department in their policies. The legal grounds that allow you to process data vary per organisation, so there’s no one single set of dimensions. However, to get you started, here’s an example of good practice dimensions:

  • Source system
  • Data product
  • Geographical Region due to regional legislation
  • Responsible data owner (single ownership for accountability)
  • Purpose limitations
  • Retention term
  • Business & privacy sensitivity classification

The goal of a dataset is to be able to group physical data for purposes of:

  • Logically organizing data
  • Data Ownership and Data Governance
  • Request and approval against purpose limitations
  • Granting and Managing Security / Data Access Rights
  • Managing Data Retention and Disposal
  • Classifying Data for Privacy
  • Classifying Data for Geographical Regions
  • To make Reuse of data easier
  • Archiving Data to cheaper Storage tiers

Let’s make it practical

Decomposing datasets distinguishes conceptual data models, data objects and the attributes (from a logical model). These should then be linked up to the actual physical data.

Data Attributes
Data Attributes
Data Objects
Data Objects
Datasets
Datasets
Conceptual
Data Entities
Conceptual…
Datasets group data objects on sensitivity, retention, ownership, regulatory region, etc to enable governance and access management
Datasets group data objects…
Conceptual entities are the data representations understood by business users and relate to business objects affected by business processes.
Conceptual entities are the…
Business Users
Business Users
Technology Users
Technology Users
Data objects represent a detailing of data concepts to the level of tables and attributes.
Data objects represent a det…
Data attributes describe the data that is captured like columns of a table.
Data attributes describe the…
Text is not SVG – cannot display

An example is shown below on how data privacy classification of attributes affect datasets. Classsifying all attributes is a lot of work while you can manage your data already to a sufficient extent when this is classified at dataset level:

German Employees
German Employees
Legal
entity
Legal…
Legal
structure
Legal…
Person
Person
Employee
Employee
Home
address
Home…
Employee
Employee
Employee
contract
Employee…
Employee
Employee
Performance evaluation
Performance evalu…
Legal
entity
Legal…
Organisational
unit
Organisational…
Employee contract
Employee contract
Perf. reviews
Perf. reviews
Org. structure
Org. structure
Person
Person
Employee address
Employee address
Employee
Employee
Performance
evaluation
Performance…
Employee
contract
Employee…
Abscence
Abscence
Organisational
unit
Organisational…
Legal
entity
Legal…
Conceptual model
Conceptual model
City
City
Employee Address
Employee Address
Country
Country
Postal codes
Postal codes
Logical model
Logical model
Street
Street
Employee
Employee
Public
Public
Strictly
confi-
dential
Strictly…
Strictly
confi-
dential
Strictly…
Confi-
dential
Confi-…
Internal
Internal
Person
Person
Employee
Employee
Home
address
Home…
Confi-
dential
Confi-…
Dutch Employees
Dutch Employees
Datasets
Datasets
Attributes
Attributes
Internal
Internal
Public
Public
Confi-
dential
Confi-…
Strictly
confi-
dential
Strictly…
Text is not SVG – cannot display

The dataset is an abstraction at a logical level that makes it easier to govern a collection of physical data. With the right information at dataset level like business owner, purpose limitation, business sensitivity, privacy sensitivity, retention term etc, you can start managing access to this data.

Note that the dataset is a logical grouping of data. This does not necessarily mean you must physically seperate different datasets to comply with governance requirements. An example when you must physically seperate the data is in case of a regional requirement to store data within country borders.

Data products & datasets

Taking it one step further: how do datasets relate to data products?

As per the HR example, you can consider this to be a single data product. A data product can expose access to different datasets that the data product holds.

To optimize the storage, you could expose dataset as (virtual) views on the actual data.

Posted on 15/03/202228/11/2022

Launching synthetic data within your company? Understand results and possibilities!

The concept of synthetic data generation is the following; taking an original dataset which is based on actual events. And create a new, artificial dataset with similar statistical properties from that original dataset. These similar properties allow for the same statistical conclusions if the original dataset would have been used.

Generating synthetic data increases the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It creates new and representative data that can be processed into output that plausibly could have been drawn from the original dataset.

Synthetic data is created through the use of generative models. This is unsupervised machine learning based on automatically discovering and learning of regularities / patterns of the original data.

Large volume synthetic data as twin of original data

Why is synthetic data important now?

With the rise of Artificial Intelligence (AI) and Machine Learning, the need for large and rich (test & training) data sets increases rapidly. This is because AI and Machine Learning are trained with an incredible amount of data which is often difficult to obtain or generate without synthethic data. Large datasets are in most sectors not yet available at scale, think about health data , autonomous vehicle sensor data, image recognition data and financial services data. By generating synthetic data, more and more data will become available. At the same time, consistency and availability of large data sets are a solid foundation of a mature Development/Test/Acceptance/Production (DTAP) process, which is becoming a standard approach for data products & outputs.

Existing initiatives on federated AI (where data availability is increased by maintaining the data within the source, the AI model is sent to the source to perform the AI algorithms there) have proven to be complex due to differences between (the quality) of these data sources. In other words, data synthetization achieves more reliability and consistency than federated AI.

An additional benefit of generating synthetic data is compliance to privacy legislations. Synthesized data is less (but not zero) easy directly referable to an identified or identifiable person. This increases opportunities to use data, enabling data transfers to cross-borders cloud servers, extend data sharing with trusted 3rd parties and selling data to customers & partners.

Relevant considerations

Privacy

Synthetisation increases data privacy but is not an assurance for privacy regulations.

A good synthethisation solution will:

  • include multiple data transformation techniques (e.g., data aggregation);
  • remove potential sensitive data;
  • include ‘noise’ (randomization to datasets);
  • perform manual stress-testing.

Companies must realize that even with these techniques, additional measures such as anonymization can still be relevant.

Outliers

Outliers may be missing: Synthetic data mimics the real-world data, it is not an exact replica of it. So, synthetic data may not over some outliers that the original data has. Yet, outliers are important for training & test data.

  • Quality

Quality of synthetic data depends on the quality of the data source. This should be taken into account when working with synthetic data.

  • Black-box

Although data synthetization is taking center stage in current hype cycles, for most companies it is still in the pioneering phase. This means that at this stage the full effect of unsupervised data generation is unclear. In other words, it is data generated by machine learning for machine learning. A potential double black box. Companies need to build evaluation systems for quality of synthetic datasets. As use of synthetic data methods increases, assessment of quality of their output will be required. A trusted synthetization solution must always include good information on the origin of the set, its potential purposes for usage, its requirements for usage, a data quality indication, a data diversity indication, a description on (potential) bias and risk descriptions including mitigating measures based on a risk evaluation framework.

Synthetic data is a new phenomenon for most digital companies. Understanding the potential and risk will allow you to keep up with the latest development and ahead of your competition or even your clients!

Posted on 25/01/202230/03/2023

Do you know the physical location of your data?

“Where is your data stored?” is not asked often enough. An interesting topic, brought up by an owner of a Dutch cloud company during a radio program.

It surprised me. As a Data Professional who has lost count of the number of times I have asked this question. Why is this still not a common concern? 

When I ask the question, the reactions vary. Often times the answer is a simple “I don’t know” When I do find the right specialist to tell me these details, the conversation goes something like this:

“Where is the data stored?”
“In the cloud.“
“Yes, but where are those servers located?”
“In Europe”
“Yes, what country in Europe?”
“In The Netherlands. “
“Yes, do you know in which city?”
“Yes, in Amsterdam.”

The tedious process goes to show that it’s a question that is not asked often enough. But it matters. 

It’s interesting, it seems we don’t often enough realise that although data is digital, it always has a physical aspect to it. Much like us, the data has to live somewhere. In turn, cyber security is not just about the digital security of who can access your data, but also the physical one. What physical measures are in place to ensure that no one breaches your data centres? Do you know who can access the building where your data lives? However, security is but one concern when it comes to the physical location of data.

Laws and regulations mandate where data may be stored and processed. And processing includes ‘viewing’ data. GDPR, for instance, requires that data is stored in Europe. That means that some international cloud providers are automatically a non-option, when they don’t have data centres in Europe. If your data centres are located in a different country, you’re automatically dealing with cross-country transfers. Be vocal about this towards your cloud provider. If you don’t decide where your data is physically stored, they will. Then it’s out of your control. And most likely not in line with legal requirements.

Did you know that there are countries that demand their data is stored and processed locally only?

Location may also influence the reliability of you data. If your data centres are located in an area where power outages are common, you will be dealing with limited availability. Or if it is a politically unstable region, your data centre may be put out of the running all together. Next to that, the further away the physical location, the higher the risk for limited connectivity. Both the quality of the network and the distance can greatly impact connectivity. All these aspects can influence your ability to deliver the data-driven digital products and services your customers are paying for. In other words, thinking about the location of your data is business critical.

So my takeaway for you, start asking the question: do you know where your data is stored? 

Posted on 23/11/202124/11/2021

Take the confusion out of your metadata discussions

Business metadata

All good data starts with business metadata. This business metadata is the information we need to build a dataset. There is someone in the business who approved the collection and processing of data in the first place. He/she also provides requirements and descriptions on what is needed. The challenge is that this information is often not managed good enough throughout time which leads business metadata quality to decrease. And thereby decrease good data. And that affects your AI solutions, BI reporting and integration with platforms.

Become aware of the necessity and value of business metadata to enable support on data requests, make it findable and also understandable!

When we know what business stakeholders want, we can design and implement this into physical form through technical metadata. We can now build the solution or buy it of the shelf and map it to the business metadata.

Operational metadata

Now that we know what data we need, what it means and have a place to store and process data; we can start doing business. Doing business will generate operational metadata. Operational metadata is very valuable in monitoring our data processes. We get insights in what data is processed, how often, the speed and frequency. This is great input in analysing the performance of our IT landscape and see where improvements can be made. Further we monitor the access to systems and data. When we take it a step further we can even start analysing patterns and possibly spot odd behaviour as signals of threats to our data.

Step into the driving seat capturing and analysing your operational metadata and become pro-active in controlling your IT landscape!

Social Media metadata

Finally we take the social metadata as an inspiration. This is where the value of your data becomes even more tangible. Value is determined by the benefit the user experiences. The way that he uses the data is then an indicator of value. Thus if we start measuring what data is used often by many users, this data must be important and valuable. Invest in improving the quality of thatdata to improve the value created. Behaviour is also a good indicator to measure. How much time is spent on content and which content is skipped quickly. Apparently that content doesn’t match up with what the user is looking for.

Measure social metadata to analyse what data is used often by many. It is likely to be more valuable than other data.

Business metadata

Governance metadata
All metadata required to correctly control the data like retention, purpose, classifications and responsibilities.
– Data ownership & responsibilities
– Data retention
– Data sensitivity classifications
– Purpose limitations

Descriptive metadata
All metadata that helps understand and use and find the data.
– Business terms, data descriptions, definitions and business tags
– Data quality and descriptions of (incidental) events to the data
– Business data models & bus. lineage

Administrative metadata
All metadata that allows for tracking authorisations on data.
– Metadata versioning & creation
– Access requests, approval & permissions

Technical metadata

Structural metadata
All metadata that relates to the structure of the data itself required to properly process it.
– Data types
– Schemas
– Data Models
– Design lineage

Preservation metadata
All metadata that is required for assurance of the storage & integrity of the data.
– Data storage characteristics
– Technical environment

Connectivity metadata
All metadata that is necessary for exchanging data like API’s and Topics.
– Configurations & system names
– Data scheduling

Operational metadata

Execution metadata
All metadata generated and captured in execution of data processes.
– Data process statistics (record counts, start & end times, error logs, functions applied)
– Runtime lineage & ETL/ actions on data

Monitoring metadata
All metadata that keeps track of the data processing performance & reliability.
– Data processing runtime, performance &  exceptions
– storage usage

Controlling (logging) metadata
All metadata required for security monitoring & proof of operational compliance.
– Data access & frequency, audit logs
– Irregular access patterns

Social metadata

User metadata
All metadata generated by users of data to
– User provided content
– User tags (groups)
– Ratings & reviews

Behavior metadata
All metadata that can be derived from observation to
– Time content viewed
– Number of users/ views/ likes/ shares

  • Privacy Policy
  • Disclaimer
  • Cookie Policy
  • Refund and Returns Policy
  • LinkedIn
  • Medium
  • Mail

© Coöperatie D8A Directors U.A.
Chamber of Commerce 85811815
VAT number NL863751143B01

Privacy Policy Proudly powered by WordPress