1. Introduction

The Open Plant Image Archive (OPIA) is a comprehensive and versatile database specifically created to catalog benchmark datasets and document the phenotypic traits of plants using high-quality images. OPIA encompasses a wide array of plant image datasets, including those of staple crops and other plant species. These datasets cover diverse growth and development stages, encompassing multi-tissue types and a broad spectrum of benchmark datasets designed for plant phenotyping applications.

The utilization of image-based phenotype data holds immense significance in the realm of smart agriculture. OPIA aims to address this need by offering carefully curated image datasets that have been integrated from various sensors and experimental acquisition platforms. These datasets are particularly invaluable for computer vision phenotyping tasks, enabling researchers and practitioners to extract meaningful insights and drive advancements in the field.

With its extensible design, OPIA provides a robust foundation for expanding and incorporating new datasets, ensuring that it remains at the forefront of plant image archiving. By serving as a centralized repository for plant phenotypic data, OPIA facilitates collaboration, knowledge sharing, and the development of cutting-edge methodologies in the pursuit of sustainable and efficient agricultural practices.

2. Data Curation

2.1. Pipeline of data curation

We have developed a standardized process for curating plant image datasets, obtaining high-quality, multi-metadata information from benchmark dataset literature and public websites. Our process involves literature search, filtering, metainformation collection, dataset curation, statistical analysis, and dataset archiving. We begin with a comprehensive literature search, identifying relevant publications and websites. After filtering the literature, we collect detailed metainformation about the datasets, including species, growth stages, and tissue types. Next, we download the datasets and curate them meticulously, removing inconsistencies and errors. Rigorous statistical analysis is performed to extract valuable insights and identify trends within the data. Finally, we archive the curated dataset, ensuring its availability and facilitating collaboration. Our standardized process supports innovative research in plant science and enables advancements in sustainable agriculture.


Figure 1 The pipeline of OPIA Curation

2.2. Literature retrieval

We conducted literature searches using PubMed, Google Scholar, and academic search engines, using keywords such as image phenotype datasets and plant species. If the literature included publicly available plant image phenotype datasets, we archived them in the OPIA.

2.3. Curation models

Through literature integration and meticulous curation, we log essential information related to the dataset. In cases where the image dataset encompasses multiple species for a research task, we split the dataset by species. For each species, we record dataset description, detailed information, image acquisition device, resolution, sensor type, image format, sampling location, dataset storage size, computer vision task, sampling platform, number of images, potential applications, image annotations, dataset source, and other pertinent details.

In datasets that include trait values, we also record information such as strain names, spatiotemporal classifications of traits, and trait values. This comprehensive documentation ensures a thorough understanding of the dataset's characteristics and facilitates its effective utilization in various research and application scenarios.

Table 1: The curation model for dataset basic information
Data type Description Examples
Dataset Dataset abbreviation CosegPPB, DeepPheno2017, Durum, GrainSpaceM, etc.
Species Staple crops and other model plants Wheat, Rice, Maize, Soybean, Buckwheat, Arabidopsis, etc.
Tissues Biological tissue/organ of plants Shoot, Root, Leaf, Grain, Ear, etc.
Computer vision tasks Basic tasks in the computer vision field Image classification, Image detection, etc.
Data description Dataset biological application Classification of leaf diseases, Fine-grained recognition, Identification and localization of root tips, Rice density estimation, etc.
Image count The total number of images in the dataset 109, 236, 361, 1034, 2696, 3355, 5090, 16000, 75000, etc.
Data details A brief introduction to the dataset The dataset images were acquired from three kinds of device prototypes: P600, G600 and M600 …, etc.
Labeled instances The total number of the label in the dataset 275187, 70870, 30729, 77910, etc.
Dataset size The storage size of the dataset 3.6G, 1.7G, 2.4G, 583M, 15G, 6.5G, etc.
Potential application The application scenario of the dataset in practice Plant counting, Tassel counting, Ear counting, Plant counting, locating and sizing, Rice density estimation, Wheat head localization, Classification of leaf diseases, etc.
Annotation types Annotation based on image information Semantic segmentation, Image-level annotations, Bounding boxes, Point annotations of root tips, etc.
PMID Unique identification of the publication information 34778804, 33313541, 37040495, etc.
Data_url The uniform resource locator of dataset https://git.io/mtc-uav, https://sites.google.com/site/poppinace/, etc.
Table 2: The curation model for each image detail information
Data type Description Examples
Acquisition equipment Capture image device information DJI Mavic 2 Pro, high-resolution CCD digital camera (E450 Olympus), DJI Phantom 4 Advanced DJI GS Pro, etc.
Sampling geo Geographic location of image sampling Nanchang, Jiangxi, China; USA; Mexico; Republic of Sudan; Norway; Belgium; etc.
Sensor Sensor type of image sampling RGB, NIR, IR, CF, etc.
Platform Image acquisition experiment platform UAV, Ground-based field robot, Controlled environment stationary platform, Manned mobile platform; Handheld visible light camera in the field; Rail-based field automated gantry, etc.
Resolution Image resolution size 5472×3648, 3648×2736, 6000×4000, 3820×2767, etc.
Image format A file format used to store, edit, and display digital images. JPG, PNG, TIF, JPEG, etc.
Mode Image mode refers to the color mode of a picture, also called color space. RGB, CMYK, grayscale, RGBA, etc.
Image size The size of the storage space occupied by the image in bytes. 875983, 5157226, 10715136, 4033734, 4190557
Label category For an image classification task, each image belongs to a category. (Image detection tasks, this field is empty) AP, NOR, Corn_(maize)___Cercospora_leaf_spot Gray_leaf_spot, null, etc.
Table 3: The definition of plant trait
Data type Description Examples
Species Staple crops and other model plants Rice, etc.
Trait_name Detailed description of traits The height of whole plant, The total projected area of whole plant, etc.
Abbr Trait abbreviation PlantH, PlantTPA, PlantHWr, etc.
Spatio-temporal type Traits across spatial or temporal Spatial, temporal
Attribute_category The traits are classified according to plant, germplasm, stem, and phenology Plant related traits, panicle related traits, culm related traits, etc.
Definition The definition of each trait The straight-line distance from the base of the plant to the highest point of the plant, etc.
Method Specific calculation methods for traits Plant height at each time point was extracted, and the maximum plant height was defined as plant height in the whole time series
Survey Strain_tag The tag corresponding to the image name R0001, R0002, R0003, etc.
Strain Name_cn The Chinese name of plant strain 龙粳31号, 中嘉早17, 绥粳18, etc.
Strain Breeding_area Plant strain breeding area Heilongjiang, Zhejiang, Guangdong, Jiangsu, etc.
Strain Subspecies A taxonomic category that ranks below species, usually a fairly permanent geographically isolated race Japonica, Indica, etc.

2.4. Trait terms normalization and association

In order to achieve standardized representation of plant traits, we utilize a set of reference ontologies called Plant Phenotype Traits Ontology (PPTO, https://ngdc.cncb.ac.cn/gwas/browse/ontology). The trait entities are mapped to PPTO to unify their representation. We search for the trait terms within PPTO, which has been established by GWAS (Genome-Wide Association Studies). By comparing the recorded trait terms and their definitions with the trait character belonging to PPTO, we create a unified representation of trait belonging. This unified representation facilitates easy retrieval and utilization of trait information.

3. Search

OPIA provides efficient search capabilities for plant image datasets based on specific species names, plant tissues, and computer vision tasks. Additionally, we employ filtering techniques to extract plant phenotypic trait data that corresponds to the image dataset, using strain tags. This allows for targeted retrieval of relevant plant phenotype information associated with specific image datasets.

4. i-traits

Image-based traits (i-traits) refer to characteristics or features of an object or organism that are assessed or measured using image analysis techniques. These traits are typically quantifiable and can provide valuable information about the object or organism being studied. Image-based trait analysis involves capturing images of the object or organism and using specialized software or algorithms to extract relevant information or measurements from these images (See reference 1-3).

On the “i-traits” page, we present the stored image dataset along with its associated trait value data. The trait page consists of four main sections: strain screening, dataset information display, image information display, and phenotype value display. The strain screening section allows users to filter and select specific strains of interest. The dataset information display provides general details about the dataset, such as its origin, species, and other relevant information. The image information display showcases specific details about the images within the dataset, including acquisition parameters, image resolution, and sensor type. The phenotype value display module presents the actual trait values associated with the plant images. Additionally, it provides information on the position of the strain's phenotype value in the population, offering valuable context and insights. By presenting this comprehensive information on the trait page, users can easily access and analyze the trait values, enabling a deeper understanding of the phenotypic characteristics and their distribution within the studied population.

5. Downloads

The Download page of the library provides access to all plant image datasets and their associated phenotypic values that have been archived. Users can perform fuzzy searches using species names, dataset names, organizations, and other relevant keywords to find specific datasets of interest. Each dataset file and character value file can be downloaded using FTP transfer mode. Additionally, for users who wish to download image data for a particular computer vision task, we provide evaluation scores as a reference. These scores assist users in selecting datasets that align with their specific requirements and evaluation criteria. By offering convenient download options and supplementary evaluation scores, we aim to enhance accessibility and usability, facilitating the retrieval and utilization of plant image datasets for various research and application purposes.

6. Dataset evaluation

To assess the quality and performance of each image dataset, we assign Evaluation Scores based on three factors: image quantity, image quality, and the balance of categories in image classification tasks. These scores are determined by normalizing and aggregating ten calculation indicators as outlined in the table below. A higher score indicates a better overall performance of the dataset. Factors considered for Evaluation Scores:

1. Image Quantity: The total number of images available in the dataset.
2. Image Quality: The quality and clarity of the images, ensuring they are suitable for analysis.
3. Balance of Categories: The distribution and balance of different categories or classes within the dataset for image classification tasks.

Calculation indicators used for determining the Evaluation Scores:

Data type Description Examples
Images The total number of images in the dataset. 109, 236, 361, 1034, 2696, 3355, etc.
Labels The total number of the annotated instance in the dataset. 275187, 70870, 30729, 77910, etc.
StorageSize The storage size of the dataset. 3.6G, 1.7G, 2.4G, 583M, 15G, 6.5G, etc.
Tasks Quantifying the potential applications of the image dataset. Classification of leaf diseases, Fine-grained recognition, Identification and localization of root tips, Rice density estimation, etc.
Devices Quantifying the device prototypes in the dataset. DJI Mavic 2 Pro, high-resolution CCD digital camera (E450 Olympus), DJI Phantom 4 Advanced DJI GS Pro, etc.
SamplingSites Quantifying the geographic locations of image dataset sampling (from several experimental sites). Nanchang, Jiangxi, China; USA; Mexico; Republic of Sudan; Norway; Belgium; etc.
Resolution The average of all image resolutions in the dataset. 5472×3648, 3648×2736, 6000×4000, 3820×2767, etc.
WithGT Whether the dataset contains label information 1(with ground truth), 0(without ground truth)
WithTrait Whether the dataset contains phenotypic traits analysed by image processing. 1(with phenotype values), 0(without phenotype value)
Gini Class balance, as measured by the Gini index, serves as a means to evaluate the distribution equilibrium within an image classification dataset. 0.779, 0.33, 0.987, etc.

By considering these factors and calculation indicators, we generate Evaluation Scores that provide an objective measure of the dataset's performance. This enables users to compare and select datasets that align with their specific needs and requirements for image classification tasks.

7. Image preprocessing tools

The Tools page provides users with five fundamental image preprocessing tools: image cropping, image resizing, image flipping, image denoising, and image dataset partitioning. These tools support batch image processing, allowing users to perform operations on multiple images simultaneously. Each time a user utilizes an image processing tool, the operation is recorded as a task for future reference.

7.1 Image Cropping

Users can upload a compressed file containing one or multiple images. On the page, they input four parameters: the desired height and width of the cropped image, as well as the x and y coordinates of the clipping center. The tool supports popular image formats such as PNG, JPEG, JPG, and TIFF. Users can upload files up to 300MB in size.

7.2 Image Resizing

Users can upload compressed files containing single or multiple images. On the page, they specify the desired height and width of the resized image. To utilize the results of previous image processing tasks, users can simply enter the historical task ID without re-uploading the image. Similar to image cropping, the tool supports common image formats and allows for files up to 300MB in size.

7.3 Image Flipping

The image-flipping tool enables users to rotate images counterclockwise without changing their size. Users can input parameters such as the rotation angle and flipping direction. The tool has default values for parameters that are not entered. If users wish to utilize the results of previous image processing tasks, they can simply enter the historical task ID without the need to upload the images again. The tool supports popular image formats including PNG, JPEG, JPG, and TIFF. Users can upload files up to 300MB in size.

7.4 Image denoising

The image denoising tool supports general image formats such as PNG, JPEG, JPG, and TIFF. Users can upload files up to 300MB in size. This tool does not require any input parameters from the user. It applies a Gaussian filter to the image, calculates the absolute difference between the filtered image and the original image, and extracts the image boundary to determine the noise ratio. Based on the noise ratio, the program selects appropriate denoising methods.

7.5 Image partitioning

To perform image partitioning, users need to provide a compressed package containing multiple image files. Input parameters include the proportion of the training set and the computer vision task type of the dataset. If the uploaded image dataset is intended for classification tasks, the folder name of each image corresponds to the category it belongs to. The program will automatically divide the dataset into separate training set and testing set folders based on the original category labels and the specified partition ratio. Image partitioning allows users to create a well-balanced dataset for training and evaluation purposes, ensuring that the model is tested on unseen data. This process aids in assessing the model's performance and generalization capabilities.

8. Support

8.1. Funding Support

  • Strategic Priority Research Program of Chinese Academy of Sciences (XDA08020102)
  • National Natural Science Foundation of China (32000475)
  • Youth Innovation Promotion Association of CAS (2017141)

8.2. Comments & Collaborations

We look forward to worldwide comments, suggestions and guidance from colleagues and peers with common research interests. We also invite the scientific community to submit their datasets to OPIA and to build collaborations in improving the functionalities of OPIA.

8.3. FeedBack

We would love to hear from you for any questions or comments. Please find our contact information here.

Telephone: +86 (10) 8409-7620
Fax: +86 (10) 8409-7298
Email: opia@big.ac.cn
Postal Address:
National Genomics Data Center, Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS)
NO 1 Beichen West Road, Chaoyang District, Beijing 100101, China

8.4. Licenses

OPIA is free for academic use only. For any commercial use, please contact us for commercial licensing terms.

8.5. Reference

  • Wang, W., Guo, W., Le, L., Yu, J., Wu, Y., Li, D., Wang, Y., Wang, H., Lu, X., Qiao, H. et al. (2023) Integration of high-throughput phenotyping, GWAS, and predictive models reveals the genetic architecture of plant height in maize. Mol. Plant, 16, 354-373.
  • Tang, Z., Chen, Z., Gao, Y., Xue, R., Geng, Z., Bu, Q., Wang, Y., Chen, X., Jiang, Y., Chen, F. et al. (2023) A Strategy for the Acquisition and Analysis of Image-Based Phenome in Rice during the Whole Growth Period. Plant Phenomics, 5, 0058.
  • Zhang, G., Zhou, J., Peng, Y., Tan, Z., Zhang, Y., Zhao, H., Liu, D., Liu, X., Li, L., Yu, L. et al. (2023) High-throughput phenotyping-based quantitative trait loci mapping reveals the genetic architecture of the salt stress tolerance of Brassica napus. Plant Cell Environ., 46, 549-566.