The concept of synthetic data generation is the following; taking an original dataset which is based on actual events. And create a new, artificial dataset with similar statistical properties from that original dataset. These similar properties allow for the same statistical conclusions if the original dataset would have been used.
Generating synthetic data increases the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It creates new and representative data that can be processed into output that plausibly could have been drawn from the original dataset.
Synthetic data is created through the use of generative models. This is unsupervised machine learning based on automatically discovering and learning of regularities / patterns of the original data.
Large volume synthetic data as twin of original data
Why is synthetic data important now?
With the rise of Artificial Intelligence (AI) and Machine Learning, the need for large and rich (test & training) data sets increases rapidly. This is because AI and Machine Learning are trained with an incredible amount of data which is often difficult to obtain or generate without synthethic data. Large datasets are in most sectors not yet available at scale, think about health data , autonomous vehicle sensor data, image recognition data and financial services data. By generating synthetic data, more and more data will become available. At the same time, consistency and availability of large data sets are a solid foundation of a mature Development/Test/Acceptance/Production (DTAP) process, which is becoming a standard approach for data products & outputs.
Existing initiatives on federated AI (where data availability is increased by maintaining the data within the source, the AI model is sent to the source to perform the AI algorithms there) have proven to be complex due to differences between (the quality) of these data sources. In other words, data synthetization achieves more reliability and consistency than federated AI.
An additional benefit of generating synthetic data is compliance to privacy legislations. Synthesized data is less (but not zero) easy directly referable to an identified or identifiable person. This increases opportunities to use data, enabling data transfers to cross-borders cloud servers, extend data sharing with trusted 3rd parties and selling data to customers & partners.
Relevant considerations
Privacy
Synthetisation increases data privacy but is not an assurance for privacy regulations.
A good synthethisation solution will:
include multiple data transformation techniques (e.g., data aggregation);
remove potential sensitive data;
include ‘noise’ (randomization to datasets);
perform manual stress-testing.
Companies must realize that even with these techniques, additional measures such as anonymization can still be relevant.
Outliers
Outliers may be missing: Synthetic data mimics the real-world data, it is not an exact replica of it. So, synthetic data may not over some outliers that the original data has. Yet, outliers are important for training & test data.
Quality
Quality of synthetic data depends on the quality of the data source. This should be taken into account when working with synthetic data.
Black-box
Although data synthetization is taking center stage in current hype cycles, for most companies it is still in the pioneering phase. This means that at this stage the full effect of unsupervised data generation is unclear. In other words, it is data generated by machine learning for machine learning. A potential double black box. Companies need to build evaluation systems for quality of synthetic datasets. As use of synthetic data methods increases, assessment of quality of their output will be required. A trusted synthetization solution must always include good information on the origin of the set, its potential purposes for usage, its requirements for usage, a data quality indication, a data diversity indication, a description on (potential) bias and risk descriptions including mitigating measures based on a risk evaluation framework.
Synthetic data is a new phenomenon for most digital companies. Understanding the potential and risk will allow you to keep up with the latest development and ahead of your competition or even your clients!
In the digital world, there are two main flavours, those with extensive data and those that require extensive data.
Find your data entrepreneurship
– In this article, we leave out the data-native (Big Tech) companies -.
Those with extensive data, are in fact the (international) corporations with trusted brands, mature system landscapes and long long-lasting relationships with customers and partners. They can build upon large quantities of (historical) data, consistently used for existing processes and products. These could do much more with their data, maneuvering (the Gambit) real value out of their data.
Most corporations already invested in structural advantages for a competitive data edge: a supporting platform infrastructure, data quality monitoring, established data science teams and a data steward / data scientist attitude. For a maximal return on those investments, companies need to go the extra mile.
A strategy for data
The most common pitfall of a data strategy is that it becomes an overview of big words only, with a (too!) high focus on technology and analytics. Yet technology should be an enabler. And analytics is just a manifestation. Don’t gamble with data (products), a good data strategy starts with a clear vision, related to market, technology and (regulatory) developments. Include a target operating model to achieve the strategy. But most of all, include on the value of data. Determine use-case types that will create most value. Large corporations have an unparalleled knowledge of industry and markets and are uniquely positioned to oversee this. Of course, there are value-cases for efficiency gains and productivity improvements. Limiting to these obvious values, tends to close doors on new opportunities. Companies must have a clear ambition pathway to data-driven revenue. This new revenue can include rewiring customer interaction, creating a completely new product or business and stepping into new markets.
In practice, data driven revenues proof to be more difficult than imagined. The effort to introduce new products within new markets combined with uncertain results make companies hesitant. Without a solid and funded ambition and a defined risk appetite, this can result into only minimal innovations, such as adding data features (apps!). Compared to data-native companies, this minimal innovation sometimes seems small potatoes. A clear data strategy gives companies mature guidance for innovation KPIs, investments, risks, and market opportunities. The data strategy will help to build success and develop new services, products and even ventures.
Data equals assets
In general, there are two flavours when it comes to data within companies. Companies have less data than they realize. Or companies have more data than they realize and have an under-utilization of the data, due to insufficient awareness of its value. Understanding the value of your data is based on 5 pillars:
History
Historical data cannot be easily replicated, years of data about customers, productions, operations, financial performance, sales, maintenance, and IP are enormously valuable. Such historical data is beneficial for increasing operational efficiency, building new data products and growing customer intimacy. Although Big Tech companies have been around for some years already, they can not compete with dedicated historical data sets. If the (meta) data is of good quality, the value increases even more. Mapping where this data resides gives an up-to-date overview of relevant data throughout the system landscape.
Privacy
Corporations are highly aware of the relevance of privacy regulations and have adopted data privacy measures and controls into their data operations. This way, the data that is available is for sure in accordance with (global) data privacy legislation.
Integration
Being part of a – traditional – chain with external suppliers and receivers (e.g. supplying materials to a manufacture who sells it to a retailer) can leverage the data into multiple views on e.g. sourcing and warehouse management. Established corporations are uniquely situated to build data-chains. Having a trusted brand creates traction for cooperation and partnerships to capture, integrate, store, refine and offer data & insights to existing and new markets.
“Understanding the value of data means requires real entrepreneurship”
Extension
Large corporations can enhance existing & new products with data, e.g., through sensor data. Big Tech companies are doing that now mostly for software products. More traditional companies are particularly capable to do this for hardware products. This way of thinking is still very much underdeveloped, because it is difficult to introduce a new product or even worse, enter a new market with a new product. Yet, it is also the ultimate opportunity! Build data entrepreneurship, by starting small while understanding the full potential of data. Examples of small starts are identifying if a data model can be IP – e.g., when it is part of a larger hardware product. In real life, starting small often means focusing on a solution that is close to home, e.g., joining multiple data sets into one and/or build dashboard, which can be offered to customers as extended service. These are often chosen because of feasibility reasons. From a data product perspective, don’t consider such an approach as not small; consider it as not even starting. Companies that do not progress beyond these products should at least have a simultaneous experimental track, building and failing new products and services for lessons learned what works and what doesn’t. Understanding the value of data requires entrepreneurship (see also the example of Rolls Roycehere.)
Data entrepreneurship
Large and established corporations are the epiphany of entrepreneurship. It is at their very core. Yet, often not enough for data. Data can be so alien to them that experimenting for value is hesitant or not happening. And this is where start-up companies are not lacking. They might not have the large historical data sets, trusted data chains or easy connections with available hardware products. They do have the entrepreneurial spirit and are highly aware of the value of data. And have the capability to experiment and become successful with new products.
Becoming data entrepreneurial means knowing which data you have, understanding the (potential) value and daring to look beyond the obvious.