The Open Plant Image Archive (OPIA) is a comprehensive and versatile database specifically created to catalog benchmark datasets and document the phenotypic traits of plants using high-quality images. OPIA encompasses a wide array of plant image datasets, including those of staple crops and other plant species. These datasets cover diverse growth and development stages, encompassing multi-tissue types and a broad spectrum of benchmark datasets designed for plant phenotyping applications.
The utilization of image-based phenotype data holds immense significance in the realm of smart agriculture. OPIA aims to address this need by offering carefully curated image datasets that have been integrated from various sensors and experimental acquisition platforms. These datasets are particularly invaluable for computer vision phenotyping tasks, enabling researchers and practitioners to extract meaningful insights and drive advancements in the field.
With its extensible design, OPIA provides a robust foundation for expanding and incorporating new datasets, ensuring that it remains at the forefront of plant image archiving. By serving as a centralized repository for plant phenotypic data, OPIA facilitates collaboration, knowledge sharing, and the development of cutting-edge methodologies in the pursuit of sustainable and efficient agricultural practices.
We have developed a standardized process for curating plant image datasets, obtaining high-quality, multi-metadata information from benchmark dataset literature and public websites. Our process involves literature search, filtering, metainformation collection, dataset curation, statistical analysis, and dataset archiving. We begin with a comprehensive literature search, identifying relevant publications and websites. After filtering the literature, we collect detailed metainformation about the datasets, including species, growth stages, and tissue types. Next, we download the datasets and curate them meticulously, removing inconsistencies and errors. Rigorous statistical analysis is performed to extract valuable insights and identify trends within the data. Finally, we archive the curated dataset, ensuring its availability and facilitating collaboration. Our standardized process supports innovative research in plant science and enables advancements in sustainable agriculture.
Figure 1 The pipeline of
OPIA
Curation
We conducted literature searches using PubMed, Google Scholar, and academic search engines, using keywords such as image phenotype datasets and plant species. If the literature included publicly available plant image phenotype datasets, we archived them in the OPIA.
Through literature integration and meticulous curation, we log essential information related to the dataset. In cases where the image dataset encompasses multiple species for a research task, we split the dataset by species. For each species, we record dataset description, detailed information, image acquisition device, resolution, sensor type, image format, sampling location, dataset storage size, computer vision task, sampling platform, number of images, potential applications, image annotations, dataset source, and other pertinent details.
In datasets that include trait values, we also record information such as strain names, spatiotemporal classifications of traits, and trait values. This comprehensive documentation ensures a thorough understanding of the dataset's characteristics and facilitates its effective utilization in various research and application scenarios.
Data type | Description | Examples |
---|---|---|
Dataset | Dataset abbreviation | CosegPPB, DeepPheno2017, Durum, GrainSpaceM, etc. |
Species | Staple crops and other model plants | Wheat, Rice, Maize, Soybean, Buckwheat, Arabidopsis, etc. |
Tissues | Biological tissue/organ of plants | Shoot, Root, Leaf, Grain, Ear, etc. |
Computer vision tasks | Basic tasks in the computer vision field | Image classification, Image detection, etc. |
Data description | Dataset biological application | Classification of leaf diseases, Fine-grained recognition, Identification and localization of root tips, Rice density estimation, etc. |
Image count | The total number of images in the dataset | 109, 236, 361, 1034, 2696, 3355, 5090, 16000, 75000, etc. |
Data details | A brief introduction to the dataset | The dataset images were acquired from three kinds of device prototypes: P600, G600 and M600 …, etc. |
Labeled instances | The total number of the label in the dataset | 275187, 70870, 30729, 77910, etc. |
Dataset size | The storage size of the dataset | 3.6G, 1.7G, 2.4G, 583M, 15G, 6.5G, etc. |
Potential application | The application scenario of the dataset in practice | Plant counting, Tassel counting, Ear counting, Plant counting, locating and sizing, Rice density estimation, Wheat head localization, Classification of leaf diseases, etc. |
Annotation types | Annotation based on image information | Semantic segmentation, Image-level annotations, Bounding boxes, Point annotations of root tips, etc. |
PMID | Unique identification of the publication information | 34778804, 33313541, 37040495, etc. |
Data_url | The uniform resource locator of dataset | https://git.io/mtc-uav, https://sites.google.com/site/poppinace/, etc. |
Data type | Description | Examples |
---|---|---|
Acquisition equipment | Capture image device information | DJI Mavic 2 Pro, high-resolution CCD digital camera (E450 Olympus), DJI Phantom 4 Advanced DJI GS Pro, etc. |
Sampling geo | Geographic location of image sampling | Nanchang, Jiangxi, China; USA; Mexico; Republic of Sudan; Norway; Belgium; etc. |
Sensor | Sensor type of image sampling | RGB, NIR, IR, CF, etc. |
Platform | Image acquisition experiment platform | UAV, Ground-based field robot, Controlled environment stationary platform, Manned mobile platform; Handheld visible light camera in the field; Rail-based field automated gantry, etc. |
Resolution | Image resolution size | 5472×3648, 3648×2736, 6000×4000, 3820×2767, etc. |
Image format | A file format used to store, edit, and display digital images. | JPG, PNG, TIF, JPEG, etc. |
Mode | Image mode refers to the color mode of a picture, also called color space. | RGB, CMYK, grayscale, RGBA, etc. |
Image size | The size of the storage space occupied by the image in bytes. | 875983, 5157226, 10715136, 4033734, 4190557 |
Label category | For an image classification task, each image belongs to a category. (Image detection tasks, this field is empty) | AP, NOR, Corn_(maize)___Cercospora_leaf_spot Gray_leaf_spot, null, etc. |
Data type | Description | Examples |
---|---|---|
Species | Staple crops and other model plants | Rice, etc. |
Trait_name | Detailed description of traits | The height of whole plant, The total projected area of whole plant, etc. |
Abbr | Trait abbreviation | PlantH, PlantTPA, PlantHWr, etc. |
Spatio-temporal type | Traits across spatial or temporal | Spatial, temporal |
Attribute_category | The traits are classified according to plant, germplasm, stem, and phenology | Plant related traits, panicle related traits, culm related traits, etc. |
Definition | The definition of each trait | The straight-line distance from the base of the plant to the highest point of the plant, etc. |
Method | Specific calculation methods for traits | Plant height at each time point was extracted, and the maximum plant height was defined as plant height in the whole time series |
Survey Strain_tag | The tag corresponding to the image name | R0001, R0002, R0003, etc. |
Strain Name_cn | The Chinese name of plant strain | 龙粳31号, 中嘉早17, 绥粳18, etc. |
Strain Breeding_area | Plant strain breeding area | Heilongjiang, Zhejiang, Guangdong, Jiangsu, etc. |
Strain Subspecies | A taxonomic category that ranks below species, usually a fairly permanent geographically isolated race | Japonica, Indica, etc. |
In order to achieve standardized representation of plant traits, we utilize a set of reference ontologies called Plant Phenotype Traits Ontology (PPTO, https://ngdc.cncb.ac.cn/gwas/browse/ontology). The trait entities are mapped to PPTO to unify their representation. We search for the trait terms within PPTO, which has been established by GWAS (Genome-Wide Association Studies). By comparing the recorded trait terms and their definitions with the trait character belonging to PPTO, we create a unified representation of trait belonging. This unified representation facilitates easy retrieval and utilization of trait information.
OPIA provides efficient search capabilities for plant image datasets based on specific species names, plant tissues, and computer vision tasks. Additionally, we employ filtering techniques to extract plant phenotypic trait data that corresponds to the image dataset, using strain tags. This allows for targeted retrieval of relevant plant phenotype information associated with specific image datasets.
Image-based traits (i-traits) refer to characteristics or features of an object or organism that are assessed or measured using image analysis techniques. These traits are typically quantifiable and can provide valuable information about the object or organism being studied. Image-based trait analysis involves capturing images of the object or organism and using specialized software or algorithms to extract relevant information or measurements from these images (See reference 1-3).
On the “i-traits” page, we present the stored image dataset along with its associated trait value data. The trait page consists of four main sections: strain screening, dataset information display, image information display, and phenotype value display. The strain screening section allows users to filter and select specific strains of interest. The dataset information display provides general details about the dataset, such as its origin, species, and other relevant information. The image information display showcases specific details about the images within the dataset, including acquisition parameters, image resolution, and sensor type. The phenotype value display module presents the actual trait values associated with the plant images. Additionally, it provides information on the position of the strain's phenotype value in the population, offering valuable context and insights. By presenting this comprehensive information on the trait page, users can easily access and analyze the trait values, enabling a deeper understanding of the phenotypic characteristics and their distribution within the studied population.
The Download page of the library provides access to all plant image datasets and their associated phenotypic values that have been archived. Users can perform fuzzy searches using species names, dataset names, organizations, and other relevant keywords to find specific datasets of interest. Each dataset file and character value file can be downloaded using FTP transfer mode. Additionally, for users who wish to download image data for a particular computer vision task, we provide evaluation scores as a reference. These scores assist users in selecting datasets that align with their specific requirements and evaluation criteria. By offering convenient download options and supplementary evaluation scores, we aim to enhance accessibility and usability, facilitating the retrieval and utilization of plant image datasets for various research and application purposes.
To assess the quality and performance of each image dataset, we assign Evaluation Scores based on three factors: image quantity, image quality, and the balance of categories in image classification tasks. These scores are determined by normalizing and aggregating ten calculation indicators as outlined in the table below. A higher score indicates a better overall performance of the dataset. Factors considered for Evaluation Scores:
1. Image Quantity: The total number of images available in the dataset.
2. Image Quality: The quality and clarity of the images, ensuring they are suitable for
analysis.
3. Balance of Categories: The distribution and balance of different categories or classes
within the dataset for image classification tasks.
Calculation indicators used for determining the Evaluation Scores:
Data type | Description | Examples |
---|---|---|
Images | The total number of images in the dataset. | 109, 236, 361, 1034, 2696, 3355, etc. |
Labels | The total number of the annotated instance in the dataset. | 275187, 70870, 30729, 77910, etc. |
StorageSize | The storage size of the dataset. | 3.6G, 1.7G, 2.4G, 583M, 15G, 6.5G, etc. |
Tasks | Quantifying the potential applications of the image dataset. | Classification of leaf diseases, Fine-grained recognition, Identification and localization of root tips, Rice density estimation, etc. |
Devices | Quantifying the device prototypes in the dataset. | DJI Mavic 2 Pro, high-resolution CCD digital camera (E450 Olympus), DJI Phantom 4 Advanced DJI GS Pro, etc. |
SamplingSites | Quantifying the geographic locations of image dataset sampling (from several experimental sites). | Nanchang, Jiangxi, China; USA; Mexico; Republic of Sudan; Norway; Belgium; etc. |
Resolution | The average of all image resolutions in the dataset. | 5472×3648, 3648×2736, 6000×4000, 3820×2767, etc. |
WithGT | Whether the dataset contains label information | 1(with ground truth), 0(without ground truth) |
WithTrait | Whether the dataset contains phenotypic traits analysed by image processing. | 1(with phenotype values), 0(without phenotype value) |
Gini | Class balance, as measured by the Gini index, serves as a means to evaluate the distribution equilibrium within an image classification dataset. | 0.779, 0.33, 0.987, etc. |
By considering these factors and calculation indicators, we generate Evaluation Scores that provide an objective measure of the dataset's performance. This enables users to compare and select datasets that align with their specific needs and requirements for image classification tasks.
The Tools page provides users with five fundamental image preprocessing tools: image cropping, image resizing, image flipping, image denoising, and image dataset partitioning. These tools support batch image processing, allowing users to perform operations on multiple images simultaneously. Each time a user utilizes an image processing tool, the operation is recorded as a task for future reference.
Users can upload a compressed file containing one or multiple images. On the page, they input four parameters: the desired height and width of the cropped image, as well as the x and y coordinates of the clipping center. The tool supports popular image formats such as PNG, JPEG, JPG, and TIFF. Users can upload files up to 300MB in size.
Users can upload compressed files containing single or multiple images. On the page, they specify the desired height and width of the resized image. To utilize the results of previous image processing tasks, users can simply enter the historical task ID without re-uploading the image. Similar to image cropping, the tool supports common image formats and allows for files up to 300MB in size.
The image-flipping tool enables users to rotate images counterclockwise without changing their size. Users can input parameters such as the rotation angle and flipping direction. The tool has default values for parameters that are not entered. If users wish to utilize the results of previous image processing tasks, they can simply enter the historical task ID without the need to upload the images again. The tool supports popular image formats including PNG, JPEG, JPG, and TIFF. Users can upload files up to 300MB in size.
The image denoising tool supports general image formats such as PNG, JPEG, JPG, and TIFF. Users can upload files up to 300MB in size. This tool does not require any input parameters from the user. It applies a Gaussian filter to the image, calculates the absolute difference between the filtered image and the original image, and extracts the image boundary to determine the noise ratio. Based on the noise ratio, the program selects appropriate denoising methods.
To perform image partitioning, users need to provide a compressed package containing multiple image files. Input parameters include the proportion of the training set and the computer vision task type of the dataset. If the uploaded image dataset is intended for classification tasks, the folder name of each image corresponds to the category it belongs to. The program will automatically divide the dataset into separate training set and testing set folders based on the original category labels and the specified partition ratio. Image partitioning allows users to create a well-balanced dataset for training and evaluation purposes, ensuring that the model is tested on unseen data. This process aids in assessing the model's performance and generalization capabilities.
We look forward to worldwide comments, suggestions and guidance from colleagues and peers with common research interests. We also invite the scientific community to submit their datasets to OPIA and to build collaborations in improving the functionalities of OPIA.
We would love to hear from you for any questions or comments. Please find our contact information here.
OPIA is free for academic use only. For any commercial use, please contact us for commercial licensing terms.