Synthetic data is on the rise, more specifically structured synthetic data generated by artificial intelligence (AI). From climate research to suicide prevention, from self-driving cars to big data privacy solutions – synthetic data is finding its way into more and more areas. For data protection projects, test runs with structured synthetic data sets are possible free of charge thanks to the Austrian start-up Mostly.ai.

Synthetic data is used in particular where there is not enough (variant) data or where existing original data cannot be used directly for reasons of data protection. For example, in the development of self-driving cars, synthetic data generation is used to try out more variants of similar situations virtually.

In this way, some existing recordings of children suddenly running onto the road can be expanded to include thousands of synthetic recordings in which children with different looks are running from different directions at different times of the day and weather conditions on road surfaces of different designs, without having to sacrifice real children.

Software for autonomous vehicles can then be virtually confronted with all kinds of synthetically generated situations and prove itself. The US company Parallel Domain was founded in 2017 to create virtual worlds from real street maps. It now fills these worlds with a variety of lighting and weather conditions, as well as synthetic vehicles and people who sometimes behave in surprising ways. Customers include Continental, Google, and Woven Planet (formerly the Toyota Research Institute, which acquired Lyft’s self-driving subsidiary). Well-known examples of synthetic data for more diversity are AI-generated images of human faces or cats.

Used for privacy concerns, the goal of data synthesis is to completely and irreversibly anonymize existing data without losing the usefulness and usability of the statistical information contained in the real data. However, anonymization through synthetic data only works correctly if essential protective measures are taken. The most obvious example: the AI ​​trained on the real data must not replicate the real data (and metadata) too exactly, of course, otherwise you could simply copy the database.

Also, the original data (including metadata) usually has to be slightly reduced: Special outliers have to be removed before the AI ​​is trained on them (called rare category protection in technical English). There are simply not that many Germans who are multiple Formula 1 world champions and have had serious skiing accidents. The risk of recreating Michael S. in synthetic data would be too great. Further protective measures must be taken when creating databases from synthetic data – the progress that AI experts have made in de-anonymization is remarkable. However, the details are beyond the scope of this article.


Portrait photo Alexandra Eberts

Portrait photo Alexandra Eberts

Alexandra Ebert, chair of the IEEE Synthetic Data Industry Connections

Such synthetic data, which is based on real personal data, is legally anonymized data, but technically it may not be personal data at all. There are still no standards for synthetic data and their use for data protection purposes. The IT industry organization IEEE Standards Association has a working group that does preliminary work for standardization. her name is Synthetic Data Industry Connections and is being organized by Austrian Alexandra Ebert, who by day job is Chief Trust Officer at Mostly.ai, a company based in Vienna and New York City. The company generates synthetic data for companies such as Nvidia, Telefonica, insurance companies, banks or the city of Vienna, with particular attention to data protection.

In March, Ebert was a guest on the c’t data protection podcast “Auslegungssache 58” on the subject of synthetic data. “Synthetic data works in such a way that, in contrast to traditional anonymization, you don’t tinker with the original data set, try to delete, change or falsify something, but you only use the original data set to train artificial intelligence. Put simply, this AI then has the task , to find out how the (producers) of the data behave. What are the statistics, the patterns, the time dependencies,” she explained in Interpretation Case 58.

Traditional anonymization uses destructive methods that are based on original data sets and delete parts. Often not much is left. This then limits the usefulness of the data. “Training something like AI (on traditionally anonymized data) is no longer possible in a meaningful way,” Ebert stated. At the same time, the risk of re-identification would remain: because traditional anonymization no longer works with behavioral data from big data, such as bank transactions or health data. AI is simply too good at re-identification.

Examples are easy to find in the health data space. To promote artificial insemination it could help the better assess the quality of early-stage embryos (blastoycysts).. The fertility center at the Kepler University Hospital in Linz, Upper Austria, is researching the appropriate AI together with the Software Competence Center Hagenberg. Because not so many images of blastocysts are available, an AI (specifically Generative Adversarial Networks) has created further variants. Not unlike BMW has an AI for quality assurance in production – it was trained using hundreds of thousands of synthetic images generated at the push of a button.

The US Department of Veterans Affairs has launched the “Mission Daybreak” $20 million awarded to find ways to reduce suicide rates among ex-servicemen. In the first round of the competition, 20 projects were selected, which will now have access to synthetic data on veterans and their health. The real data can be found The winners of the competition are to be announced in the near future, and it will then become clear whether and how they will use the synthetic data.

For the financial sector, Ebert describes the example of a bank’s transaction data in the c’t podcast. This shows how often retirees go to the ATM or how often students shop at Amazon. “All of this is learned at a very granular level (by an AI); and then, in a completely separate step, the algorithm is used to generate new synthetic data,” Ebert said. “I then have synthetic consumers and their synthetic financial transactions . There is no 1:1 relationship between a real (human) and any synthetic individual.” Nevertheless, the same statistics can be found in the data set as in the original data. The samples that are valuable for the bank are retained, but without any personal reference relevant to data protection.

In other words, the stories told by the synthetic data are very similar to the stories in the original data, but the characters involved are different. Properly synthesized, however, it should not be a simple remix of real data, but newly created data sets. The industry promises that over 90% of the information contained in a bundle of data will be preserved with synthesis. With traditional anonymization, done correctly, it would often only be a single-digit percentage.

The synthesized data can be shared with third parties or published as open data. And of course your own company can use the synthetic data where it is not allowed to evaluate the original data because it was collected for other purposes (legal keyword: earmarking).

To make it easier for companies and researchers to get started working with synthetic data for privacy concerns, Mostly.ai is providing a toll-free generator available for experiments. With the test data generator, each user can use their own data and generate up to 100,000 rows of synthetic data per day.


(ds)

To home page

California18

Welcome to California18, your number one source for Breaking News from the World. We’re dedicated to giving you the very best of News.

Leave a Reply