CLARIN’s repository contains datasets, models and software. They are mostly products of the language technology program, but also various other data submitted by institutions and individuals. On malbankinn.is, the website of The Icelandic Language Bank (CLARIN-IS B-centre), most of what is found in the repository is listed in a structured way, which gives a good overview of the content of the repository.
The CLARIN B-Centre in Iceland, hosted by the Árni Magnússon Institute for Icelandic Studies, has changed its name and is now called The Icelandic Language Bank. To mark the occasion, a new website — https://malbankinn.is — was launched today, providing secure and accessible access to Icelandic language resources.
Everyone is welcome to download materials from the bank, but the main target groups are researchers and students in the humanities and social sciences who study Icelandic language and society, as well as developers who wish to access datasets, models, and tools related to language technology. The data continue to be hosted on CLARIN’s repository.
The Icelandic Gigaword Corpus (IGC) has now been expanded with data from 2022 and 2023. This additional data can be downloaded from the CLARIN repository and searched on the Corpora Website of the Árni Magnússon Institute. In addition, the Corpora Website has been updated and some minor flaws have been fixed.
The first edition of The Icelandic Gigaword Corpus was published in 2018 and new editions appeared every year for the first five years. Each time new data was added and tagging methods were improved. The first edition contained about 1,259 million running words, while the second edition contained 2,439 million running words. It was not considered necessary to publish the corpus in its entirety this time, as the methods of tagging and processing of texts have not changed since the last edition was issued. Therefore, an addendum with data in 2022 and 2023 was published, containing around 162 million running words. On the Corpora Website people can search in a new version of the corpus where the new data has been added to the 2022 edition.
The Árni Magnússon Institute's language processing website is up again, enhanced and improved. There you can use the following tools, both by pasting text into a form and by using an API: