||An assessment of the taxonomic reliability of DNA barcode sequences in publicly available databases
||55. Jin S, Kim KY, Kim MS, Park C
||Algae. (2020) 35(3): 293-301
With rapid technological advances in molecular biology, the applications of DNA barcoding have increased. The technique has a wide range of uses, such as in taxonomic studies to help elucidate cryptic species and phylogenetic relationships and analyzing environmental samples for biodiversity monitoring and conservation assessments of species. After obtaining the DNA barcode sequences using polymerase chain reaction amplification, sequence similarity-based homology analysis is commonly used. This means that the obtained barcode sequences are compared to the DNA barcode reference databases. This bioinformatic analysis necessarily implies that the overall quantity and quality of the reference databases must be stringently monitored to not have an adverse impact on the accuracy of species identification. With the development of high-throughput, next-generation sequencing techniques, a noticeably large number of DNA barcode sequences have been produced and are stored in online databases, but their degree of validity, accuracy, and reliability have not been extensively and thoroughly investigated. In this study, we investigated the extent to which the amount and types of erroneous barcode sequences were deposited in publicly accessible databases. Over 4.1 million sequences were investigated in three large-scale DNA barcode reference databases (NCBI GenBank, BOLD, and PR2) for four major DNA barcodes (COI, ITS, rbcL, and 18S rRNA); approximately 2\% of erroneous barcode sequences were found and, intriguingly, their taxonomic distributions were uneven. Consequently, our present findings provide compelling evidence of data quality problems along with insufficient and unreliable annotation of taxonomic data in DNA barcode reference databases. Therefore, we suggest that if ambiguous taxa are presented during barcoding analysis, further validation with other DNA barcode loci or morphological characters should be mandated.