Austrian synthetic data startup MOSTLY AI has launched a $100,000 prize challenge to create the best synthetic data set from a real data set.
The challenge is open to everyone and anyone. Applicants will be judged on how anonymised the data is, how accurately it reflects the original data, usability, and compute efficiency. The goal is to spark much-needed innovation in the synthetic data space — and the winner's code will be open-sourced afterwards for public use.
MOSTLY's own privacy-preserving synthetic data platform mimics real data without exposing sensitive information, with high-fidelity outputs that are recognised as some of the most accurate in the market, making them suitable for advanced AI and machine learning applications.
MOSTLY's platform enables organisations to safely unlock access to their sensitive data assets and realise the full potential of this data to drive AI innovations and, in doing so, address the problems with historical data anonymisation.
The company is one of Austria's best-funded startups, having raised $25 million in 2022. The company supports global clients, including. Citi Bank, the U. Department of Homeland Security, and Erste Group. They recently open sourced their core tech product to advance understanding and innovation in the space.
I spoke with MOSTLY AI's Chief AI & Data Democratisation Officer, Alexandra Ebert, to learn more.
According to Ebert, the company wanted to do something bold— "something that hasn't really been done in the past 20 years, at least not at this scale."
"The last time something similar happened was the Netflix Prize, which offered a $1 million reward. While we're not Netflix (yet!), the idea is similar: to spark innovation using synthetic data."
The need for better synthetic data
With AI coming under much more pressure from data privacy advocates, big corporations and startups are pivoting to synthetic data to train and inform their AI models (Nvidia just acquired an SD startup for $320 million). Governments are clocking on too — e.g. it's mentioned in the UK Government's AI Opportunities Action Plan.
According to Ebert, Synthetic data has immense potential — not just for businesses, but for society at large.
"It can help accelerate healthcare research, climate insights, and open up innovation for startups and smaller players by giving them access to granular, relevant, privacy-safe data.
The goal is to inspire many more competitions in the future, where synthetic data can play a central role in making meaningful datasets more accessible. It's a push away from the unrealistic "toy datasets" we see on platforms like Kaggle, toward something much closer to real-world complexity and value."
What kind of data are participants working with?
The competition uses real-world data that is publicly available but not widely known — so it's more realistic than typical Kaggle datasets, but still accessible.
According to Ebert, "We've lightly masked the datasets by replacing some column names with fun placeholders like "cat" and "dolphin" to prevent reverse engineering."
There are two independent challenges:
- The FLAT DATA Challenge uses static data (think, for example, customer records, where entries don't change much).
- The SEQUENTIAL DATA Challenge uses sequential data (like financial transactions or mobile location patterns), which is significantly more complex.
So far the competition has seen particular interest from students and people in the early stages of their computer science careers, especially from regions like the Global South.
While the $100k prize pool may not attract top-tier data scientists from Meta or AWS, it's a big draw for emerging talent globally.
Ebert detailed:
"We only have two key eligibility rules: participants must have a GitHub account created before the competition launch (to avoid people gaming the system with multiple accounts), and their submissions must meet minimum privacy and accuracy thresholds to be considered for the leaderboard."
MOSTLY has already seen strong submissions for the static data challenge, and while the sequential one is more technically demanding, "it's wide open — there's still $50k up for grabs in each track. So we're encouraging as many people as possible to get involved."
What are the judges looking for in submissions?
Besides privacy and accuracy, the top five submissions in each challenge will also be evaluated on creativity, ease of use, and generalisbility.
Ebert detailed:
"We're not just looking for solutions that overfit the dataset — we want ideas that could be useful across domains and inspire broader applications of synthetic data."
Would you like to write the first comment?
Login to post comments