aws-s3-integrity-check: an open-source bash tool to verify the integrity of a dataset stored on Amazon S3.

Sonia Garc��a-Ruiz, Regina Hertfelder Reynolds, Melissa Grant-Peters, Emil Karl Gustavsson, Aine Fairbrother-Browne, Zhongbo Chen, Jonathan William Brenton, Mina Ryten
Author Information
  1. Sonia Garc��a-Ruiz: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID
  2. Regina Hertfelder Reynolds: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID
  3. Melissa Grant-Peters: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID
  4. Emil Karl Gustavsson: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID
  5. Aine Fairbrother-Browne: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID
  6. Zhongbo Chen: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID
  7. Jonathan William Brenton: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID
  8. Mina Ryten: Department of Genetics and Genomic Medicine Research & Teaching, UCL GOS Institute of Child Health, London, UK. ORCID

Abstract

Amazon Simple Storage Service (Amazon S3) is a widely used platform for storing large biomedical datasets. Unintended data alterations can occur during data writing and transmission, altering the original content and generating unexpected results. However, no open-source and easy-to-use tool exists to verify end-to-end data integrity. Here, we present , a user-friendly, lightweight, and reliable bash tool to verify the integrity of a dataset stored in an Amazon S3 bucket. Using this tool, we only needed ���114 min to verify the integrity of 1,045 records ranging between 5 bytes and 10 gigabytes and occupying ���935 gigabytes of the Amazon S3 cloud. Our tool also provides file-by-file on-screen and log-file-based information about the status of each integrity check. To our knowledge, this tool is the only open-source one that allows verifying the integrity of a dataset uploaded to the Amazon S3 Storage quickly, reliably, and efficiently. The tool is freely available for download and use at https://github.com/SoniaRuiz/aws-s3-integrity-check and https://hub.docker.com/r/soniaruiz/aws-s3-integrity-check.

References

  1. Gigascience. 2022 Dec 28;12: [PMID: 37132521]
  2. Genome Biol. 2009;10(11):R134 [PMID: 19930550]
  3. Gigascience. 2022 Dec 28;12: [PMID: 37039115]
  4. BMC Bioinformatics. 2010 May 18;11:259 [PMID: 20482786]
  5. Database (Oxford). 2014 Mar 12;2014:bau018 [PMID: 24622612]
  6. Nat Rev Genet. 2016 May 17;17(6):333-51 [PMID: 27184599]
  7. Nat Methods. 2023 Jan;20(1):6-11 [PMID: 36635542]
  8. Drug Discov Today. 2017 Apr;22(4):712-717 [PMID: 28163155]
  9. GigaByte. 2023 Aug 23;2023:gigabyte87 [PMID: 37637773]
  10. Genomics Proteomics Bioinformatics. 2023 Feb;21(1):24-47 [PMID: 36252814]
  11. J Proteome Res. 2009 Jun;8(6):3148-53 [PMID: 19358578]
  12. Acta Neuropathol. 2021 Sep;142(3):449-474 [PMID: 34309761]
  13. Nat Commun. 2020 Feb 25;11(1):1041 [PMID: 32098967]
  14. Nucleic Acids Res. 2022 Jan 7;50(D1):D980-D987 [PMID: 34791407]

Word Cloud

Similar Articles

Cited By