This week one of SiloGen, the large language model (LLM) arm of Europe's largest private AI labs Silo AI, launches a consortium together with TurkuNLP, a research group at the University of Turku, to develop a family of open LLMs, including the world's largest open source LLM.
In addition to compute access totaling approximately 15 million GPU hours, the initiative is dedicated to ensuring that data utilised in these models accurately represent European languages, also covering the English-speaking world.
SiloGen has been operational since late 2022 and is currently working on its technology at full speed with clients like Allianz, Sandvik and Tietoevry. Its core focus is to improve downstream and domain-specific applications and to ensure companies can utilise trustworthy models for private, confidential and proprietary data.
The consortium includes extensive data resources covering all European languages and code, including High-Performance Language Technology (HPLT) data, and other collected and curated data, and access to compute, including software infrastructure to train LLMs and access to LUMI, the third largest supercomputer in the world and the largest in Europe.
Having built LLMs on LUMI for over a year, the team has developed a distinctive software layer for effectively and efficiently training LLMs on the AMD-based hardware.
According to Sampo Pyysalo, University of Turku Research fellow and HPLT principal investigator:
"LLMs are rapidly reshaping how we access information and interact with technology. As their impact grows, it is increasingly important to ensure that the models are developed in a transparent and reproducible manner and made openly available to ensure accountability and equal access to the technology.
From a European perspective, it is also critical that models are designed from the outset to prioritise multilingualism and an equitable approach to all languages.
The High Performance Language Technologies (HPLT) project is addressing these goals through the creation of open European data resources and language models and is delighted to partner in this consortium with SiloGen and Silo AI, an industry leader with shared goals."
Peter Sarlin, CEO and co-founder of Silo AI shared:
"This initiative helps to ensure that underlying models are based on data and information representing the citizens and organisations of the region and overall compliance with regulation, data privacy and other vital concerns.
And eventually we need sovereignty on how downstream applications and value creation happen.
This requires trusted and secure approaches to independent base models that enable fine-tuning for domain-specific needs. This way we can ensure digital sovereignty, while advancing technological development,"
Beyond open base models, SiloGen is also expanding its LLM development platform, to cater to the need to build more accurate, trustworthy and robust downstream applications. Its platform includes tooling for synthetic data generation, human feedback, and quality testing.
It also comes with a long track record for natural language processing (NLP), vision and perception, as exemplified by projects together with Allianz, Honda, Rolls-Royce, Sandvik, Tietoevry and Finnish public service media company Yle.
Merja Ylä-Anttila, CEO of Yle, comments:
"For Yle it is of utmost importance that in the years to come we will have readily available access to language models that are based on our languages and that truly reflect our local culture.
We are more than happy to be part of the exploration on how public service media companies around Europe can participate in the development of trustworthy AI technologies, including language models, that take the rich diversity of languages and cultures into consideration."
Lead image: Google DeepMind.