SeasonFM: A multimodal foundation model for Earth’s seasonal rhythms

Changing phenology
Data revolution
Ongoing project
Authors

Yiluan Song

Song Lab members

Published

April 21, 2026

Keywords

global change biology, environmental data science, phenology, machine learning, foundation model

Abstract

Earth’s biological clock is losing synchrony: spring arrives earlier, but not all organisms keep pace. Every day, billions of observations accumulate across satellites, herbarium collections, and smartphones, yet these data streams remain siloed and incompatible, preventing the integrated understanding we urgently need. We propose SeasonFM, a multimodal foundation model. By learning any-to-any translation across images, text, and tabular observations, SeasonFM synthesizes Earth’s diverse seasonal data into a unified biological clock, enabling researchers to reconstruct centuries of seasonal change, mobilize previously untapped data sources, and forecast the consequences of disrupted synchrony for biodiversity and human well-being.


Multimodal data on seasonality

Earth’s seasonality is observed through an extraordinary diversity of data sources, from satellites orbiting hundreds of kilometers above to citizen scientists photographing flowers on their morning walks. These sources span images (satellite imagery, PhenoCam canopy photos, digitized herbarium specimens, iNaturalist field photos, dashcam footage), text (field notes, photo captions, historical literature, social media posts), and tabular records (phenophase status, image labels, time series of vegetation indices). Each source captures a different facet of seasonal change, at different scales and levels of detail. SeasonFM is designed to learn from these data streams simultaneously and to learn the relationships between them.

Massive observational data on seasonality exist in multiple modalities: images, text, and tabular records, with connections between datasets shown with connecting lines. Eve Carter contributed to the preparation of this figure.

Our fully-annotated training data

Rigorous evaluation of a multimodal model requires ground truth data where every modality is simultaneously observed, a combination rarely available in phenology research. Therefore, we are generating a high-quality fully-annotated dataset including field images, satellite imagery, field notes, and detailed phenophase status labels for 25 trees observed weekly. This effort provides the clean, well-controlled data needed to validate SeasonFM’s approach for cross-modal predictions before scaling.

A snapshot of our fully-annotated dataset being generated in Spring 2026, prepared by Brianna Shepherd. Satellite imagery (Sentinel-2) were retreived using Google Earth Engine.

Proposed model architecture

SeasonFM will take images, text, and tabular observations as inputs, each processed by a dedicated encoder: a Vision Transformer for images, a BERT-like transformer for text, and an MLP for tabular data. The resulting token sequences are concatenated and passed through a deep fusion transformer, which learns to integrate information across all modalities through full self-attention. The model is trained with contrastive, predictive, and generative loss objectives, and produces a fused embedding that captures high-dimensional seasonal information. A custom circular loss function ensures the model correctly represents the cyclical nature of calendar time.

Model architecture diagram proposed by Yiluan Song. Xianrui Ha contributed to the preparation of this figure.

Proof of concept

As a first test of SeasonFM’s core idea, we trained a simplified model using a frozen CLIP backbone and a shallow fusion transformer on a small tri-modal dataset of plant images from the USA-NPN Plant Phenophases Gallery, text descriptions, and binary phenophase status records. The model produced clear seasonal clustering in both modality-specific and fused latent spaces, confirming that meaningful seasonal representations emerge from multimodal training. Notably, images and text data contain richer details that are not captured by tabular data, such as whether the flowers are male or female, which might have distinct seasonality. The simplified model reached 87.0% and 95.7% accuracy in predicting the status of “open flowers” from images alone and from images and texts, respectively.

Inputs, learned latent spaces, and out-of-sample model performance from a simplified version of SeasonFM, trained on a small dataset. Training data were retrieved from USA-NPN and labeled by Brianna Shepherd.

Use case: forecasting pollen concentration

Pollen seasons are intensifying under climate change, yet forecasting pollen concentrations remains challenging partly due to the complexity responses of plant phenology to environmental cues. SeasonFM will integrate multiple data sources including satellite imagery, field images, and phenophase status records to produce seasonality embeddings that capture the phenological development of wind-pollinated plant species. Combined with meteorological data, these embeddings can drive more accurate and timely forecasts of airborne pollen concentrations, benefiting populations who suffer from seasonal allergies.

SeasonFM will generate fused embedding to represent plant seasonality using satellite imagery (Sentinel-2, retrieved using Google Earth Engine), field images (to be collected through Nature’s Botebook citizen science app), and phenophase status (available through Nature’s Botebook and iNatualist). Together with real-time meterolofical data (from Open-Meteo), we will predict pollen concentration at hourly resolution, validated by data from our pollen sampling network. Andrew Burgess collected the pollen concentration data using Pollen Sense APS400 and contributed to the preparation of this figure.

Use case: reconstructing earth’s seasonality

SeasonFM will be used to reconstruct Earth’s seasonal rhythms across centuries by unlocking the full potential of historical data sources. Herbarium specimens have been used to study long-term phenological change, but typically by extracting single metrics, such as the date of first flower, discarding the rich visual information about the degree and progression of seasonal change captured in the image itself. By learning relationships between herbarium images, handwritten labels, and structured records, SeasonFM can extract nuanced seasonal information, illuminate historical trends and patterns, and allowing us to reimagine what Earth’s spring once looked like.

Two herbarium specimens of Lindera benzoin (spicebush) collected from Fox Chapel, outside of Pittsburgh, 117 years apart. While the species was in full flower with no leaves yet on Apr 28, 1900 (left), the same species was long over flowering with nearly full sized leaves on Apr 27, 2017 (right), demonstrating the drastic changes in seasonality. Images provided by Dr. Mason Heberling.