Skip to content

D8A directors

Giving you and your data direction

  • Home
  • Content studio
    • Latest articles
    • Podcasts
    • Enterprise Directions
    • Industries
  • D8A Academy
  • Partners
    • Syntho – synthesize data
  • Our story
  • Let’s get in touch!
  •  

Category: Metadata

Posted on 15/03/202228/11/2022

Launching synthetic data within your company? Understand results and possibilities!

The concept of synthetic data generation is the following; taking an original dataset which is based on actual events. And create a new, artificial dataset with similar statistical properties from that original dataset. These similar properties allow for the same statistical conclusions if the original dataset would have been used.

Generating synthetic data increases the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It creates new and representative data that can be processed into output that plausibly could have been drawn from the original dataset.

Synthetic data is created through the use of generative models. This is unsupervised machine learning based on automatically discovering and learning of regularities / patterns of the original data.

Large volume synthetic data as twin of original data

Why is synthetic data important now?

With the rise of Artificial Intelligence (AI) and Machine Learning, the need for large and rich (test & training) data sets increases rapidly. This is because AI and Machine Learning are trained with an incredible amount of data which is often difficult to obtain or generate without synthethic data. Large datasets are in most sectors not yet available at scale, think about health data , autonomous vehicle sensor data, image recognition data and financial services data. By generating synthetic data, more and more data will become available. At the same time, consistency and availability of large data sets are a solid foundation of a mature Development/Test/Acceptance/Production (DTAP) process, which is becoming a standard approach for data products & outputs.

Existing initiatives on federated AI (where data availability is increased by maintaining the data within the source, the AI model is sent to the source to perform the AI algorithms there) have proven to be complex due to differences between (the quality) of these data sources. In other words, data synthetization achieves more reliability and consistency than federated AI.

An additional benefit of generating synthetic data is compliance to privacy legislations. Synthesized data is less (but not zero) easy directly referable to an identified or identifiable person. This increases opportunities to use data, enabling data transfers to cross-borders cloud servers, extend data sharing with trusted 3rd parties and selling data to customers & partners.

Relevant considerations

Privacy

Synthetisation increases data privacy but is not an assurance for privacy regulations.

A good synthethisation solution will:

  • include multiple data transformation techniques (e.g., data aggregation);
  • remove potential sensitive data;
  • include ‘noise’ (randomization to datasets);
  • perform manual stress-testing.

Companies must realize that even with these techniques, additional measures such as anonymization can still be relevant.

Outliers

Outliers may be missing: Synthetic data mimics the real-world data, it is not an exact replica of it. So, synthetic data may not over some outliers that the original data has. Yet, outliers are important for training & test data.

  • Quality

Quality of synthetic data depends on the quality of the data source. This should be taken into account when working with synthetic data.

  • Black-box

Although data synthetization is taking center stage in current hype cycles, for most companies it is still in the pioneering phase. This means that at this stage the full effect of unsupervised data generation is unclear. In other words, it is data generated by machine learning for machine learning. A potential double black box. Companies need to build evaluation systems for quality of synthetic datasets. As use of synthetic data methods increases, assessment of quality of their output will be required. A trusted synthetization solution must always include good information on the origin of the set, its potential purposes for usage, its requirements for usage, a data quality indication, a data diversity indication, a description on (potential) bias and risk descriptions including mitigating measures based on a risk evaluation framework.

Synthetic data is a new phenomenon for most digital companies. Understanding the potential and risk will allow you to keep up with the latest development and ahead of your competition or even your clients!

Posted on 25/01/2022

Do you know the physical location of your data?

“Where is your data stored?” is not asked often enough. An interesting topic, brought up by an owner of a Dutch cloud company during a radio program.

It surprised me. As a Data Professional who has lost count of the number of times I have asked this question. Why is this still not a common concern? 

When I ask the question, the reactions vary. Often times the answer is a simple “I don’t know” When I do find the right specialist to tell me these details, the conversation goes something like this:

“Where is the data stored?”
“In the cloud.“
“Yes, but where are those servers located?”
“In Europe”
“Yes, what country in Europe?”
“In The Netherlands. “
“Yes, do you know in which city?”
“Yes, in Amsterdam.”

The tedious process goes to show that it’s a question that is not asked often enough. But it matters. 

It’s interesting, it seems we don’t often enough realise that although data is digital, it always has a physical aspect to it. Much like us, the data has to live somewhere. In turn, cyber security is not just about the digital security of who can access your data, but also the physical one. What physical measures are in place to ensure that no one breaches your data centres? Do you know who can access the building where your data lives? However, security is but one concern when it comes to the physical location of data.

Laws and regulations mandate where data may be stored and processed. And processing includes ‘viewing’ data. GDPR, for instance, requires that data is stored in Europe. That means that some international cloud providers are automatically a non-option, when they don’t have data centres in Europe. If your data centres are located in a different country, you’re automatically dealing with cross-country transfers. Be vocal about this towards your cloud provider. If you don’t decide where your data is physically stored, they will. Then it’s out of your control. And most likely not in line with legal requirements.

Did you know that there are countries that demand their data is stored and processed locally only?

Location may also influence the reliability of you data. If your data centres are located in an area where power outages are common, you will be dealing with limited availability. Or if it is a politically unstable region, your data centre may be put out of the running all together. Next to that, the further away the physical location, the higher the risk for limited connectivity. Both the quality of the network and the distance can greatly impact connectivity. All these aspects can influence your ability to deliver the data-driven digital products and services your customers are paying for. In other words, thinking about the location of your data is business critical.

So my takeaway for you, start asking the question: do you know where your data is stored? 

Posted on 23/11/202124/11/2021

Take the confusion out of your metadata discussions

Business metadata

All good data starts with business metadata. This business metadata is the information we need to build a dataset. There is someone in the business who approved the collection and processing of data in the first place. He/she also provides requirements and descriptions on what is needed. The challenge is that this information is often not managed good enough throughout time which leads business metadata quality to decrease. And thereby decrease good data. And that affects your AI solutions, BI reporting and integration with platforms.

Become aware of the necessity and value of business metadata to enable support on data requests, make it findable and also understandable!

When we know what business stakeholders want, we can design and implement this into physical form through technical metadata. We can now build the solution or buy it of the shelf and map it to the business metadata.

Operational metadata

Now that we know what data we need, what it means and have a place to store and process data; we can start doing business. Doing business will generate operational metadata. Operational metadata is very valuable in monitoring our data processes. We get insights in what data is processed, how often, the speed and frequency. This is great input in analysing the performance of our IT landscape and see where improvements can be made. Further we monitor the access to systems and data. When we take it a step further we can even start analysing patterns and possibly spot odd behaviour as signals of threats to our data.

Step into the driving seat capturing and analysing your operational metadata and become pro-active in controlling your IT landscape!

Social Media metadata

Finally we take the social metadata as an inspiration. This is where the value of your data becomes even more tangible. Value is determined by the benefit the user experiences. The way that he uses the data is then an indicator of value. Thus if we start measuring what data is used often by many users, this data must be important and valuable. Invest in improving the quality of thatdata to improve the value created. Behaviour is also a good indicator to measure. How much time is spent on content and which content is skipped quickly. Apparently that content doesn’t match up with what the user is looking for.

Measure social metadata to analyse what data is used often by many. It is likely to be more valuable than other data.

Business metadata

Governance metadata
All metadata required to correctly control the data like retention, purpose, classifications and responsibilities.
– Data ownership & responsibilities
– Data retention
– Data sensitivity classifications
– Purpose limitations

Descriptive metadata
All metadata that helps understand and use and find the data.
– Business terms, data descriptions, definitions and business tags
– Data quality and descriptions of (incidental) events to the data
– Business data models & bus. lineage

Administrative metadata
All metadata that allows for tracking authorisations on data.
– Metadata versioning & creation
– Access requests, approval & permissions

Technical metadata

Structural metadata
All metadata that relates to the structure of the data itself required to properly process it.
– Data types
– Schemas
– Data Models
– Design lineage

Preservation metadata
All metadata that is required for assurance of the storage & integrity of the data.
– Data storage characteristics
– Technical environment

Connectivity metadata
All metadata that is necessary for exchanging data like API’s and Topics.
– Configurations & system names
– Data scheduling

Operational metadata

Execution metadata
All metadata generated and captured in execution of data processes.
– Data process statistics (record counts, start & end times, error logs, functions applied)
– Runtime lineage & ETL/ actions on data

Monitoring metadata
All metadata that keeps track of the data processing performance & reliability.
– Data processing runtime, performance &  exceptions
– storage usage

Controlling (logging) metadata
All metadata required for security monitoring & proof of operational compliance.
– Data access & frequency, audit logs
– Irregular access patterns

Social metadata

User metadata
All metadata generated by users of data to
– User provided content
– User tags (groups)
– Ratings & reviews

Behavior metadata
All metadata that can be derived from observation to
– Time content viewed
– Number of users/ views/ likes/ shares

  • Privacy Policy
  • Disclaimer
  • Cookie Policy
  • Refund and Returns Policy
  • LinkedIn
  • Medium
  • Mail

© Coöperatie D8A Directors U.A.
Chamber of Commerce 85811815
VAT number NL863751143B01

Privacy Policy Proudly powered by WordPress