Remi Verhoeven (Senior Manager at KPMG) on: the character traits, technical skills and soft skills required in order to become a successful data profession. Listen to hear about the ways data professionals can improve their skills to increase their chances in the recruitment process.
The concept of synthetic data generation is the following; taking an original dataset which is based on actual events. And create a new, artificial dataset with similar statistical properties from that original dataset. These similar properties allow for the same statistical conclusions if the original dataset would have been used.
Generating synthetic data increases the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It creates new and representative data that can be processed into output that plausibly could have been drawn from the original dataset.
Synthetic data is created through the use of generative models. This is unsupervised machine learning based on automatically discovering and learning of regularities / patterns of the original data.
Large volume synthetic data as twin of original data
Why is synthetic data important now?
With the rise of Artificial Intelligence (AI) and Machine Learning, the need for large and rich (test & training) data sets increases rapidly. This is because AI and Machine Learning are trained with an incredible amount of data which is often difficult to obtain or generate without synthethic data. Large datasets are in most sectors not yet available at scale, think about health data , autonomous vehicle sensor data, image recognition data and financial services data. By generating synthetic data, more and more data will become available. At the same time, consistency and availability of large data sets are a solid foundation of a mature Development/Test/Acceptance/Production (DTAP) process, which is becoming a standard approach for data products & outputs.
Existing initiatives on federated AI (where data availability is increased by maintaining the data within the source, the AI model is sent to the source to perform the AI algorithms there) have proven to be complex due to differences between (the quality) of these data sources. In other words, data synthetization achieves more reliability and consistency than federated AI.
An additional benefit of generating synthetic data is compliance to privacy legislations. Synthesized data is less (but not zero) easy directly referable to an identified or identifiable person. This increases opportunities to use data, enabling data transfers to cross-borders cloud servers, extend data sharing with trusted 3rd parties and selling data to customers & partners.
Relevant considerations
Privacy
Synthetisation increases data privacy but is not an assurance for privacy regulations.
A good synthethisation solution will:
include multiple data transformation techniques (e.g., data aggregation);
remove potential sensitive data;
include ‘noise’ (randomization to datasets);
perform manual stress-testing.
Companies must realize that even with these techniques, additional measures such as anonymization can still be relevant.
Outliers
Outliers may be missing: Synthetic data mimics the real-world data, it is not an exact replica of it. So, synthetic data may not over some outliers that the original data has. Yet, outliers are important for training & test data.
Quality
Quality of synthetic data depends on the quality of the data source. This should be taken into account when working with synthetic data.
Black-box
Although data synthetization is taking center stage in current hype cycles, for most companies it is still in the pioneering phase. This means that at this stage the full effect of unsupervised data generation is unclear. In other words, it is data generated by machine learning for machine learning. A potential double black box. Companies need to build evaluation systems for quality of synthetic datasets. As use of synthetic data methods increases, assessment of quality of their output will be required. A trusted synthetization solution must always include good information on the origin of the set, its potential purposes for usage, its requirements for usage, a data quality indication, a data diversity indication, a description on (potential) bias and risk descriptions including mitigating measures based on a risk evaluation framework.
Synthetic data is a new phenomenon for most digital companies. Understanding the potential and risk will allow you to keep up with the latest development and ahead of your competition or even your clients!
“Where is your data stored?” is not asked often enough. An interesting topic, brought up by an owner of a Dutch cloud company during a radio program.
It surprised me. As a Data Professional who has lost count of the number of times I have asked this question. Why is this still not a common concern?
When I ask the question, the reactions vary. Often times the answer is a simple “I don’t know” When I do find the right specialist to tell me these details, the conversation goes something like this:
“Where is the data stored?” “In the cloud.“ “Yes, but where are those servers located?” “In Europe” “Yes, what country in Europe?” “In The Netherlands. “ “Yes, do you know in which city?” “Yes, in Amsterdam.”
The tedious process goes to show that it’s a question that is not asked often enough. But it matters.
It’s interesting, it seems we don’t often enough realise that although data is digital, it always has a physical aspect to it. Much like us, the data has to live somewhere. In turn, cyber security is not just about the digital security of who can access your data, but also the physical one. What physical measures are in place to ensure that no one breaches your data centres? Do you know who can access the building where your data lives? However, security is but one concern when it comes to the physical location of data.
Laws and regulations mandate where data may be stored and processed. And processing includes ‘viewing’ data. GDPR, for instance, requires that data is stored in Europe. That means that some international cloud providers are automatically a non-option, when they don’t have data centres in Europe. If your data centres are located in a different country, you’re automatically dealing with cross-country transfers. Be vocal about this towards your cloud provider. If you don’t decide where your data is physically stored, they will. Then it’s out of your control. And most likely not in line with legal requirements.
Did you know that there are countries that demand their data is stored and processed locally only?
Location may also influence the reliability of you data. If your data centres are located in an area where power outages are common, you will be dealing with limited availability. Or if it is a politically unstable region, your data centre may be put out of the running all together. Next to that, the further away the physical location, the higher the risk for limited connectivity. Both the quality of the network and the distance can greatly impact connectivity. All these aspects can influence your ability to deliver the data-driven digital products and services your customers are paying for. In other words, thinking about the location of your data is business critical.
So my takeaway for you, start asking the question: do you know where your data is stored?
Geneviève Meerburg (Director SME Services at van Spaendonck) on: implementing a data strategy within her organisation. Geneviève shares how the importance and value of data organically grew, leading to a concrete need for a data strategy. Listen to hear how van Spaendonck approached truly living through the principles set out in the data strategy and how it helped create new services for their clients.
Date with D8A
Data strategy for small and medium-sized enterprises
All good data starts with business metadata. This business metadata is the information we need to build a dataset. There is someone in the business who approved the collection and processing of data in the first place. He/she also provides requirements and descriptions on what is needed. The challenge is that this information is often not managed good enough throughout time which leads business metadata quality to decrease. And thereby decrease good data. And that affects your AI solutions, BI reporting and integration with platforms.
Become aware of the necessity and value of business metadata to enable support on data requests, make it findable and also understandable!
When we know what business stakeholders want, we can design and implement this into physical form through technical metadata. We can now build the solution or buy it of the shelf and map it to the business metadata.
Operational metadata
Now that we know what data we need, what it means and have a place to store and process data; we can start doing business. Doing business will generate operational metadata. Operational metadata is very valuable in monitoring our data processes. We get insights in what data is processed, how often, the speed and frequency. This is great input in analysing the performance of our IT landscape and see where improvements can be made. Further we monitor the access to systems and data. When we take it a step further we can even start analysing patterns and possibly spot odd behaviour as signals of threats to our data.
Step into the driving seat capturing and analysing your operational metadata and become pro-active in controlling your IT landscape!
Social Media metadata
Finally we take the social metadata as an inspiration. This is where the value of your data becomes even more tangible. Value is determined by the benefit the user experiences. The way that he uses the data is then an indicator of value. Thus if we start measuring what data is used often by many users, this data must be important and valuable. Invest in improving the quality of thatdata to improve the value created. Behaviour is also a good indicator to measure. How much time is spent on content and which content is skipped quickly. Apparently that content doesn’t match up with what the user is looking for.
Measure social metadata to analyse what data is used often by many. It is likely to be more valuable than other data.
Business metadata
Governance metadata All metadata required to correctly control the data like retention, purpose, classifications and responsibilities. – Data ownership & responsibilities – Data retention – Data sensitivity classifications – Purpose limitations
Descriptive metadata All metadata that helps understand and use and find the data. – Business terms, data descriptions, definitions and business tags – Data quality and descriptions of (incidental) events to the data – Business data models & bus. lineage
Administrative metadata All metadata that allows for tracking authorisations on data. – Metadata versioning & creation – Access requests, approval & permissions
Technical metadata
Structural metadata All metadata that relates to the structure of the data itself required to properly process it. – Data types – Schemas – Data Models – Design lineage
Preservation metadata All metadata that is required for assurance of the storage & integrity of the data. – Data storage characteristics – Technical environment
Connectivity metadata All metadata that is necessary for exchanging data like API’s and Topics. – Configurations & system names – Data scheduling
Operational metadata
Execution metadata All metadata generated and captured in execution of data processes. – Data process statistics (record counts, start & end times, error logs, functions applied) – Runtime lineage & ETL/ actions on data
Monitoring metadata All metadata that keeps track of the data processing performance & reliability. – Data processing runtime, performance & exceptions – storage usage
Controlling (logging) metadata All metadata required for security monitoring & proof of operational compliance. – Data access & frequency, audit logs – Irregular access patterns
Social metadata
User metadata All metadata generated by users of data to – User provided content – User tags (groups) – Ratings & reviews
Behavior metadata All metadata that can be derived from observation to – Time content viewed – Number of users/ views/ likes/ shares
Bart Rentenaar (Enterprise Data Lead at Athora) on: implementing data innovation within his organisation. Bart shares examples of use cases that inspired him to get start with data innovations, the framework that employs to structure initiatives, examples of data innovations he implemented and the team that made that possible. Listen for tips when starting out with the implementation of data innovations.
Arjan Pepping (Corporate Data Manager at MN) on: creating awareness around trusted data and the role of data in control for a pension provider. Listen for the golden tip on implementing data awareness.
Marinka Voorhout (Director at Philips) on: data quality in design is becoming a pre requisite for innovations on data. Listen to practical approach tips and ideas to take data quality into account in user interfaces.
In this trailer of the Date with D8A podcast series, Simone from D8A explains the idea behind the D8A initiative and the what, why and how of the Date with D8A podcast.
It is time to increase acknowledgement of the importance of a chief data officer.
As companies move towards working data-driven, monetizing data in new and enhanced services and products is essential. Traditionally heavy regulated industries, e.g. financial and health, first focused on bringing their data in control. Their efforts concentrated mainly around data quality management, data privacy, data governance and E2E trusted data lineage. These efforts are often led — or owned — by a Chief Data Officer (CDO).
In this article, we advocate to shift or extend this focus of the chief data officer towards data incontrol AND data inuse.
CDO’s define and communicate the companies vision on data management and data use. Through this vision, the CDO gives direction, guidance, advocates for change and sets priorities for running projects. Most companies CDO’s have to some extent achieved this for data management. The extended focus of Chief Data Officers, which we advocate for in this article, contains standard processes for the design, prototyping, development, productizing and use of data & insights products & services. Furthermore, it is the CDO who defines a standard set of technology to be used to support these processes and create these solutions. Where needed, this is based on the data management foundation as implemented by the CDO in previous years.
The Chief Data Officer ideally combines business expertise, technology background and analytics/BI. Extended by a common commercial sense, understanding of production processes and knowledge of relevant 3rd party partners to cooperate with. Organisations without an ‘extended CDO’ will experience difficulties and potential delays in reaching their data-driven goals — in accordance with new developments in the market. Without strategic guidance and steering, there is an increased risk that departments and units will define their own standard processes, set of technology and data-driven products and services. Making it harder to leverage pre-existing data foundations as well as cross-unit collaboration to enable effective market penetrations. Teams will struggle to escalate and address growing concerns as sufficient C-level representation is missing.
Concluding, companies benefit from a Chief Data Officer with a focus on data in control and data in use. Top-down ownership and alignment of data initiatives, standardisation of processes and data tooling and a clear escalation path for growing concerns are necessary to succeed as a data-driven company.