Project overview
The digital revolution has transformed biodiversity research, with millions of plant photos and associated metadata becoming available from diverse sources. These include scanned botanical literature, digitized herbarium specimens, and citizen science platforms like iNaturalist. Many of these resources are accompanied by species identification, geographical locations, and dates of collection.
Together, they represent a vast yet underutilized resource not only for species-distribution mapping but also for extracting plant traits critical for ecological and evolutionary research.
This project aims to leverage these rich digital archives using cutting-edge artificial intelligence (AI) techniques, focusing on computer vision and large language models (LLMs). This project will develop state-of-the-art AI models capable of analysing these plant images to extract essential biological traits such as leaf size, flower symmetry, and flower colour. These traits provide critical insights into species' ecological strategies, evolutionary adaptations, and responses to environmental change. By analysing the visual features in these images, the project seeks to automate the process of trait extraction, making it faster, scalable, and applicable across large datasets.
Digitized herbarium specimens offer unique opportunities, as they combine high-resolution images with textual information, such as handwritten labels containing collection dates, locations, and descriptions of plant characteristics. Extracting and organizing this data remains a significant challenge due to the variability in handwriting styles, languages, and formats. To address this, the project will integrate AI-driven methods for natural language processing.
Herbarium labels will be automatically digitized, enabling the extraction of both structured and unstructured information. Specifically, open source LLMs will process textual data extracted from herbarium labels, providing context to the associated plant images. These models will enable the identification of key details, such as ecological conditions or plant-specific traits, which may not be directly apparent from the image alone. By combining visual and textual data, the project will create a comprehensive framework for understanding plant species distributions and their traits, maximizing the utility of existing biodiversity resources.
The project involves addressing several challenges. For instance, trait extraction from images requires handling variability in image quality, lighting conditions, and perspectives. Other challenges include handling diverse image sources, incomplete metadata, and harmonizing visual and textual data. To overcome this, the student will explore the use of self-supervised learning and contrastive pretraining techniques, enabling models to learn robust feature representations from unlabelled data and to associate text and images by training on large datasets to understand and generate meaningful connections between visual concepts and natural language descriptions. For textual data, issues like incomplete metadata, domain-specific terminology, and variations in language will be mitigated using fine-tuned LLMs trained on botanical literature and herbarium-specific corpora.
The outcomes of this project will have broad applications. Automating trait extraction and digitizing herbarium records will provide valuable data for ecological modelling, conservation planning, and studies on plant evolution and functional biology. The methodologies developed here could also extend to other biodiversity domains, contributing to global efforts to digitize and analyse natural history collections. This project combines AI innovation with biodiversity science, offering exciting opportunities for PhD students passionate about advancing conservation and understanding the natural world.
Training Opportunities
A comprehensive training programme will be provided, comprising training both in applied AI and biodiversity, and transferable professional and research skills. The project includes a placement with an AI-INTERVENE project partner of between 3-18 months in duration. The student will present at national and international conferences, placing the student at the forefront of the discipline, leading to excellent future employment opportunities.
Student profile
The ideal student should have a background in computer science, data science, ecology, or a related field, with strong skills in machine learning and computer vision. Experience in Python and/or AI frameworks (e.g., TensorFlow, PyTorch) is desirable.
Familiarity with image processing, natural language processing (NLP), and herbarium datasets would be beneficial. Knowledge of biodiversity science, plant ecology, or remote sensing is a plus but not mandatory. The student should be passionate about applying AI to solve real-world problems in conservation and biodiversity research and possess strong problem-solving and analytical skills.
Lead supervisor
University of Reading
Co-supervisor
Natural History Museum