Domain: Internet & Media
To develop a system that creates synthetic data, based on real data, that can be shared safely with third parties.
VRT NWS is the news service of the VRT, the Flemish public broadcast. VRT NWS is active in the field of television, radio and online.
VRT monitors the behaviour of users on its websites. This data is used for analysis, optimisation, recommendations, churn analysis and much more. The technology for these data-driven activities is constantly evolving. Typically, a third party that develops a system during a project has access to anonymised data: the user data are replaced by hashes, but the behaviour data remain unchanged. It is theoretically possible to retrace the behavioural data back to the user. An example is the Netflix Prize, where the datasets had to be taken offline after a lawsuit.
This kind of problems can be avoided by creating synthetic data. This data has the same statistical properties as the real data, but cannot be traced back to the user, as there are no real users behind it. The synthetic data can be used by third parties to develop systems.
When moved to production, the system is retrained with real data by VRT, without the third party ever having been in contact with it.
The challenge here is to develop a system that creates synthetic data, based on real data, that can be shared safely with third parties.
The challenge has the following sample datasets available for download
The synthetic data should have the same statistical properties as the real data. The successful candidate will prove this by performing relevant tests on both the synthetic and the real data:
- statistical analysis
- churn prediction
The tests should have similar results for both sets: same mean, same distributions, same evaluation scores for recommender and churn prediction.
The system should be provided in two ways:
- Demonstrator for evaluation
- Source code