GBIF logtype

Mobilizing historical biodiversity data with AI - new dataset from Uppsala University

A new dataset has recently been published through GBIF Sweden, based on digitized material from the Kullenberg Excursion Journals (1948–1996), held at the Museum of Evolution, Uppsala University.

The dataset represents an important step in unlocking biodiversity information stored in natural history archives. These journals contain tens of thousands of insect occurrence records, along with rich contextual information such as sampling details and observational notes. Making such data accessible in structured, interoperable formats has traditionally required extensive manual effort, limiting the scale at which archival data can be mobilized.

This work is part of an ongoing effort led by Britt Andermann and colleagues, in collaboration with the Biodiversity Data Lab and the Montejo-Kovacevich Lab, to explore how AI methods can accelerate the transformation of analog archival sources into analysis-ready biodiversity data. By combining high-resolution digitization with automated approaches to taxonomic name recognition and data structuring, the project demonstrates how large volumes of historical records can be processed more efficiently and published as FAIR data.

The dataset published through GBIF Sweden marks an initial outcome of this work and will continue to be expanded as the project progresses and further digitization and data extraction are carried out.

At the recent SBDI Days, Britt Andermann presented this work, highlighting how AI-assisted workflows can reduce the manual burden of transcription while opening up new possibilities for mobilizing biodiversity data at scale. Looking ahead, developments in areas such as handwriting recognition and large language models may enable even more automated extraction of information from handwritten sources, including marginal notes and annotations.

Beyond occurrence data, these archival materials also contain valuable taxonomic information, such as species descriptions and expert determinations. Future work aims to explore how such information can be extracted and linked to existing biodiversity knowledge systems, contributing to a more complete and connected understanding of biodiversity data across time.

This initiative illustrates the potential of combining digitization, AI, and research infrastructure to recover and reuse historical biodiversity data, helping to fill important temporal and taxonomic gaps.

https://www.gbif.org/dataset/459b5943-9370-42f2-8cd5-9c08f8aa5da4