Dna barcoding

Review and Interpretation of Trends in DNA Barcoding

The doing is often more important than the outcome.

—Arthur Ashe


Widely heralded as a revolutionary taxonomic discovery tool, DNA barcoding represents perhaps the most reliable framework available for organizing specimens and specimen-based data for systematic research. Arranging specimens by barcode haplotype early in the study process allows for efficient inspection of material, and facilitates the organization and management of a wealth of character data and life history information, depending on how much is available for the barcoded specimens. While DNA sequences have been used to identify specimens or parts of specimens since the 1980's, their use as a broader natural history tool was not formalized until 2003. Three organizational meetings sponsored by the Sloan Foundation at the Banbury Center at Cold Spring Harbor and seminal publications that year (Hebert et al., 2003a,b; Stoeckle, 2003) christened DNA barcoding and launched the program that would globalize its application. Since then, over 3,700 peer-reviewed papers have been published with “DNA barcoding” in their title. These studies range from taxonomic works in which DNA barcodes are used to elucidate cryptic species, to surveys of environmental samples (e.g., marine sediments, ocean water) that feature estimates of phyletic diversity and regional comparisons of genetic variation, and finally to forensic and conservation applications. Many of the early papers can be characterized as proof-of-concept studies in which the utility of the COI barcoding region was being tested for particular taxonomic groups or in different study designs. To the extent controversy emerged around barcode data, it was generally associated with the taxonomic interpretation and applicability of their analyses. These included the uniformity and generalizability of criteria for circumscribing species, the phylogenetic implications of dendrograms, and the proliferation of informal specific epithets in reference to species that were discovered through DNA barcodes but which remained undescribed. Many of these concerns were mitigated by increasingly sophisticated treatments that incorporated barcodes with morphological, behavioral and ecological data under the rubric of integrative taxonomy and, for groups such as Lepidoptera in which extensive taxonomic coverage has been achieved (Hajibabaei et al., 2006; Hausmann et al., 2016; Zahiri et al., 2017), barcode data have become commonplace if not critical to taxonomic revisionary works.

As a paradigm, DNA barcoding engendered a democratization of molecular data (or at least metadata) by automating analytical steps that might otherwise have deterred may some practicing taxonomists. This quickened the pace of alpha taxonomy by enabling the rapid and unambiguous discovery of new species in many groups. One possible drawback has been that in coopting the terminology of phylogenetics, DNA barcode endeavors may have inadvertently broadened the meaning of or even re-branded terminology in a manner inconsistent with its formal interpretation. Taxonomic papers incorporating DNA barcode data routinely present metrics or tree graphics as self-evident while conflating descriptions with diagnoses or barcode trees with phylogenies. Semantics aside, we wished to understand whether such usage reflected a manifestation of some trend in how systematics is perceived by the scientific community at large.

The rapid growth of the DNA barcode paradigm thus invites an examination of how, during a 15-year period, its ontology and application developed with respect to technological, analytical, and terminological preferences that had until only recently fallen exclusively within the purview of molecular systematists. Our purpose here is to examine the development of DNA barcoding through a coarse examination of search terms and explore whether they reflect trends in how DNA barcoding practices may have evolved to accommodate analytical and practical considerations. To the extent they have not, we highlight those considerations at the empirical intersection of DNA barcoding, taxonomy and phylogenetics that are not simply semantic.

A Conceptual Framework for Examining the Ontology of DNA Barcoding

For clarity and transparency both, it is necessary to establish a conceptual framework on which to arrange this discussion. DNA barcoding intersects with systematics most conspicuously at the level of alpha taxonomy, that is in the discovery, diagnosis, and description of new species. “Description” and “diagnosis” are formal terms defined in nomenclatural codes (e.g., ICZN) that govern the naming of species and other taxa and the means of tracking and stabilizing taxonomic nomenclature. They represent components of taxonomic refinement and formalized nomenclatural change, and correspond to the character-based empirical work of substantiating named groups as historical or natural entities. It is generally understood that taxonomic rank does not of itself confer natural comparability: Any rank above species is a function of convention and discretion as well as actual data, and as long as monophyletic groups are recognized the fact that families or tribes are not uniformly or evolutionarily equivalent does not hamper studies unless they make the mistake of treating such groups, e.g., by inferring evolutionary trends from numbers of genera, families, etc. A named species, on the other hand, is a different sort of construct that may correspond to a range of biological entities consistent with historical, reproductive, or genetic criteria. Biological or historical comparability is perhaps more easily justified for species than for higher taxa because their identity as species can at least be tested by universal criteria, namely the establishment of diagnostic characters. At supra-specific taxonomic levels, in contrast, common ancestry is depicted hierarchically and articulated with reference to apomorphy, and independently derived diagnostic characters recognized as synapomorphies provide evidence both for a given species' inclusion in a given group and for that group's monophyly.

However, the usage of monophyly has been broadened to include its graphic depiction on trees, just as the traditional use of “phylogeny” as an abstract term for evolutionary history has been expanded and pluralized to include any tree-like graphics (“phylogenies”). At least one general consequence of this usage bears directly on the practice of DNA barcoding: the perception that species be legitimately represented and expected to appear as monophyletic. Whether one disputes this on the grounds that individual organisms are not related hierarchically even if mitochondria are (Doyle, 1995), or on the grounds that species often appear paraphyletic (Funk and Omland, 2003), the disconnection between the graphic representation of a monophyletic group and the characters underlying it is amplified when trees are treated as arbiters of species boundaries. When phylogenetics began to enjoy popularity, it was because there was consensus that empirical phylogenetic considerations were important to classification and evolutionary biology, but there remained strong methodological debates to the point where trees were judged less by what they said than how they were generated. The opposite experience seems to characterize DNA barcoding as a field. How barcode data—or any sequence data—are analyzed to generate trees bears directly on how those trees may be interpreted and on the scope of how DNA barcode data are ultimately used.

The ~3,700 DNA barcoding studies published over the past 15 years represent a prodigious record of peer-reviewed research, notwithstanding the variance in their intent or in the analyses and interpretations espoused. By examining the cohort of natural history and biodiversity science that incorporated DNA barcodes over this period, we explored the extent to which their purposes, premises, rationale and application have evolved.

3756 Barcoding Papers Since 2004

We compiled a glossary of terms used in DNA barcoding from our knowledge of the literature. We attempted to be as inclusive as possible with these terms and even included some from the literature on species boundaries and, speciation mechanisms. We next used the PubMed at NCBI (https://www.ncbi.nlm.nih.gov/pubmed/) to search for peer-reviewed papers with abstracts published since 2003. We used December 31, 2018 as a cutoff for inclusion in our database. In all, we compiled the abstracts from the 3,756 peer-reviewed papers with “DNA Barcode” as a query (Figure 1A), and used the resulting database (Supplementary Folder 1) to track the usage of specific terms as described below. Perhaps naïvely, all papers retrieved by the search are assumed to have been peer-reviewed as they are included in the PubMed database. Papers were cataloged by year from 2005 to 2018 since only a few papers appeared in 2003 and 2004. Hence, we combine 2003, 2004, and 2005 into a single data point. Abstracts from each of the papers were compiled in text files by year. Word searches were done in BBedit, an efficient textline editor, that retrieves the number and location of search terms. The location of the search term hit allowed us to eliminate duplicate hits in single papers. The number of hits for each search term (or combination of terms) were compiled in excel spreadsheets. Each of the terms in the glossary (Table 1) were searched and tabulated. Figure 1 provides more detail on the search strategies for the terms we used for generating graphs. For example, the raw number of hits for the general category “Neighbor Joining” was a combination of searches for “neighbor joining” plus “NJ.”


Figure 1. Line plots of number of “hits” for keywords in the DNA barcode vocabulary subcategories established in the text. In all graphs the number of citations is given on the Y-axis and year is given on the X-axis. We also computed relative percentage of citations per year and these results are shown in Supplemental Figure 1. (A) Graph of the occurrence of scientific papers with the search word “DNA barcoding” in the title from 2003 to 2018. The “blip” in number of papers in 2016 that disrupts an otherwise smooth increase in number of papers by year might represent an increase in reports for the several international meetings that occurred in 2015. (B) The results of this analysis compare character based approaches to similarity/distance approaches. For this analysis we also use fixation as a character based term and show its usage in the graph. Search terms: “similarity” and “distance” combined into “simdis” and “character” and “fixation” combined into “char.” We show the usage of “fixation” alone to demonstrate that this term is rarely used. (C) The results of this analysis compare the three major criteria for phylogenetic analysis—distance, parsimony and likelihood. Search terms: “NJ” and “neighbor joining” combined into “NJTOT,” “parsimony” listed as “pars,” likelihood listed as “like.” Bayesian phylogenetic inference methods have also been used and these are listed under “bayes.” (D) This figure shows comparison of the usage of terms that imply an examination of the robustness of the DNA barcode analysis. Such measures of robustness can be metrics such as bootstrap, or posterior probabilities such as in Bayesian phylogenetic inference. We also Search terms: “bootstrap” listed as “boot,” “support” listed as sup, statistic, bayes. (E) The figure compares various methods of treating DNA barcode data. We include tree to demonstrate the use of tree relative to these other approaches. Search terms: barcode index “number” and “BIN” combined into “BIN,” “barcode gap” listed as “BCG,” “tree” listed as “tree,” “blast” listed as “blast” and “character aggregation organization system” and “CAOS” combined into “CAOS.” (F) This figure shows the usage of species discovery vocabulary in DNA barcoding. As we point out in the text, species description is a technical term used in taxonomy, while other terms like circumscription, delimitation and delineation are terms used by biologists studying speciation and species boundaries. Search terms: “species discovery” listed as “disc,” “species delimitation” listed as “delim,” “species delineation” listed as “delin” and “species circumscription” listed as “circum.” (G) This figure compares the usage of “species discovery” terms with “specimen identification.” We also compare the usage of “flagging” listed as “flag” and “integrative taxonomy” listed as “inttax.” Search terms: “species discovery”or “totdisc” is the sum of counts for “species discovery,” “species delineation,” “species delimitation” and “species circumscription.” (H) This figure compares the focus of papers in five areas that are generally listed by DNA barcode studies. DNA barcoding has been used in forensic studies, biodiversity studies, taxonomy, cryptic species studies and conservation biology. Search terms: “forensic” listed as “forensic.” “cryptic” listed as “cryptic,” “conservation” listed as “cons,” “taxonomy” listed as “taxon” and “biodiversity” listed as “biod”.


Table 1. A glossary of DNA barcoding terms.

An eclectic lexicon has grown around DNA barcoding, comprising a range of terms from taxonomy, phylogenetic and molecular systematics, and population genetics as well as a smattering of neologisms. The database we developed was queried for 29 terms based on our own extensive reading of the barcode literature. These terms span a range of purposes and methods, which we grouped according to (1) general disciplines (conservation/conservation biology/conservation genetics, forensic, taxonomy/systematics/integrative taxonomy, phylogeography); (2) biological terms (character, crypsis/cryptic species, fixation/fixed character, population); (3) graphic terms (clade, cluster, tree); (4) tree-building methods (Bayesian, likelihood, neighbor-joining, parsimony); (5) general purpose operational terms (diagnosis, species circumscription/delimitation/delineation, species description, species discovery, specimen identification/determination, flag); and finally (6) tools and metrics (barcode gap, BIN, BLAST, bootstrap, phylogenetic support). The queried terms comprise a combination of rudimentary verbiage commonly used in systematics and molecular evolution, with that specific to DNA barcoding. Neither their groupings nor the underlying terms are mutually exclusive, but we have tried to arrange the terms as coherently as possible. We did not account for context or whether the terms were used correctly or with approbation. In some cases, to facilitate broader comparisons we combined counts for intrinsically related terms such as similarity/distance, or terms used interchangeably such as species delimitation, circumscription and delineation. These are detailed in Figure 1, Table 1, and in Supplementary File 1.

Inevitably, this exercise is influenced by our own perspective which favors an integrative taxonomic approach to corroborating the results of barcode analyses with other observations. It is our impression that this perspective is reasonably widespread. In general, we prefer to think of DNA barcode variation as having the potential to reveal corroborating patterns in morphology and behavior than as necessary or sufficient requirements for discovering species or as means of generating universal distance thresholds as criteria for demarcating them. Our choice of queried terms also, therefore, reflects the distinction between indirect or tree-based interpretations that rely on inspecting dendrograms, and direct analyses of diagnostic characters. To the extent that trends may be evinced from our seemingly chimeric exploration of language, we hope that occasional inventories such as this serve to take stock of and even illuminate the direction of a field regardless of perspective.

We present the results in two ways: (1) in the form of raw counts by year to track raw usage (Figure 1; search terms themselves in Supplementary File 1) and; (2) as scaled percentages of the occurrence of all terms per year (Supplementary File 1). Although crude, this approach affords context for cross-comparison of year-to-year usage; we suspect more complex analysis of data such as these would simply obfuscate any observable trends.

Trends in DNA Barcoding Based on Its Vocabulary

Characters, Distance Measures, and Tree-Building Functions

An important comparison concerns the use of direct character information, which corresponds to the empirical treatment of observable data, vs. lumped (phenetic) summaries in the form of similarity or distance measures. By compressing character state information into a single measure of genetic similarity, distance measures mask changes in specific loci. As such, they do not enable one to discriminate homologous character state changes, much the way a mathematical average hides partitioned variation. For this reason, such methods have been eschewed in phylogenetic reconstruction for several decades and represent perhaps the most contentious points of discussion surrounding DNA barcodes.

The explosion of DNA barcode data and distance-based dendrograms did occasion certain remedial presentations (e.g., Prendini, 2005) of such methodological issues that had been debated and largely settled in the early decades of phylogenetic systematics. From our perspective, tree-building methods in the context of DNA barcoding are not, as they are in systematics, at issue on the grounds of their legitimacy as phylogenetic inference tools, if only because most studies suggest that COI analyzed in isolation is a fundamentally insufficient source of decisive phylogenetic information. Rather, distance methods fall short specifically in the realm of identification and diagnosis. The practical implications are (1) that above the level of very closely related species, the COI gene typically realizes its greatest contribution to phylogenetic matrices that include a combination of other organellar and nuclear genes (Cameron et al., 2007; Leavitt et al., 2013) and (2) that no level of parameterization can compensate for the levels of saturation that inevitably appear in datasets with distantly related species or particularly in datasets with more terminals than characters. The immediate concern for the purposes of DNA barcoding is not that COI is necessarily inadequate as a sole phylogenetic marker, but that the ability of any data analyzed via distance is equally impeded in serving the goals of DNA barcoding as it is in phylogeny reconstruction. This is a function of the incompatibility of distance data with the transmission of diagnostic information. Simply put, a properly rooted parsimoniously optimized tree represents the most efficient summary possible of the available data, and enables the direct diagnosis of would-be species based on observable character state changes. This is a matter of mathematics, not opinion (Farris, 1980). The ostensible advantage of Neighbor-joining is its computational ease and straightforward presentation (a single tree is generated). Interpretive issues may arise only if such analyses are accepted as decisive without further exploration.

Figure 1B compares the occurrence of the search terms “character” and “similarity+distance” and suggests a consistent preference for Neighbor-joining (NJ) a tree-building algorithm. This is of course at least in part a function of the tools available in BoLD (Ratnasingham and Hebert, 2007), and we do not suggest that these analyses are all interpreted identically or for the same purposes. Two empirically linked search terms “fixed” and “character” align with diagnostic approaches and track their usage (Figure 1B).

Explicit mention of other methods of sequence analysis, Neighbor-joining (NJ), parsimony or “maximum parsimony” (MP), maximum likelihood (ML), and Bayesian (Figure 1C), appear erratically prior to 2008. Since then, the mentions of ML and Bayesian analysis have risen but not approached those of NJ, with parsimony (MP) appearing least frequently. This result is not surprising given the initial availability of NJ as the prima facie tool in the Barcode of Life Database (BoLD) system.

Visualization and Interpretation of Trees

In our reading of the barcode literature we noted many cases where taxonomic decisions were based either directly on distance measures (e.g., the barcode gap, discussed below) or on trees generated by such measures, but effectively decoupled from justification or discussion of those methods. Following Goldstein and DeSalle (2011), we distinguish the strictly graphic, tree-based approaches from tree-independent approaches, among which we further differentiate distance-based (e.g., BIN, barcode gap, BLAST searches) from diagnostic (e.g., CAOS; Figure 1D). Despite occasional papers in which barcode NJ trees are referred to as phylogenies, many authors have been careful to stress the utility of DNA barcoding for identification and discovery, and not as explicit phylogenetic statements. To be clear, tree-based approaches are valuable both as inferential tools for visualizing prospective species delimitation, and as provisional road maps of where to direct further research in delimiting species boundaries.

The interpretation of a barcode tree as a visual first pass for demarcating species vs. a phylogeny properly focuses attention on the integrity of the species themselves rather than the groups to which they belong (see Introduction), and perhaps for this reason—as well as the nature of variation within the COI gene, the often high number of individual sequences under analysis, and the types of analysis employed—measures of nodal support tend to find limited relevance in typical barcode analyses. Measures of nodal support have been presented with increasing frequency among DNA barcoding studies (Figure 1E), but in our survey the search terms reflecting such use (bootstrap, Bayes and statistic) appear less than a fifth as frequently as the term “support” itself.

Tree graphics and BLAST searches have each been used steadily since the inception of DNA barcoding Figure 1D. The term “barcode gap” (BCG), first coined in 2005 (Meyer and Paulay, 2005 and reiterated by Wiemers and Fiedler, 2007), appears steadily after 2009 and is the most frequently used of the terms referring to tree-independent analytics. The most recently minted tree-independent approach (BIN; Ratnasingham and Hebert, 2013), is unique to DNA barcoding and its use has increased slightly since its introduction in 2010. In our survey there appears to be a preference for tree-based approaches accompanying the preference for NJ trees, and limited growth in the use of tree-independent terms (even distance-based ones) after 2015. Diagnostic algorithms (e.g., CAOS, Sarkar et al., 2008) appear rarely, consistent with the infrequent reliance on character-based tree-independent approaches relative to BIN, BLAST, and BCG. Table 2 summarizes the intersection between tree- and character-based (diagnostic) methods.


Table 2. A (not-exhaustive) categorization of the analysis space for DNA barcoding.

Specimen Identification and Species Delimitation

At the inception of DNA barcoding, two of its most frequently stressed benefits were specimen identification (or determination) and species discovery (Figure 1F). Specimen identification has been used interchangeably with “species identification” in some publications, as have a number of terms related to identification and discovery. DeSalle (2006) used the term “identification” only in the context of assigning taxonomic information. Although in the present paper we refer to this as “determination” (of specimens, not species), the published usage is too broad in intent to be parsed with any great deal of precision. Since the power of DNA barcoding resides in the coverage of the available database, the conclusion that a given species is new to science for example, is a function of whether a queried sequence corresponds to those from authoritatively identified specimens. The discovery of species new to science is thus a function of failure to assign a valid name to a given sequence under the assumption that identical (or highly similar) available sequences represent conspecific individuals. As such, “discovery” has for some authors been more controversial than identification (Matz and Nielsen, 2005), and that controversy may easily be amplified by the use of barcoding to estimate species richness in bulk samples (Andersen et al., 2012; Shokralla et al., 2012; Kress et al., 2015; Sickel et al., 2015). Specimen identification, particularly for thoroughly studied and well-sampled groups, holds broader appeal, particularly outside the academic community.

Incorporating DNA barcoding with taxonomy has been discussed and widely adopted as a form of integrative taxonomy, which simply refers to simultaneous analysis of disparate sources of data (Figure 1G). DNA barcodes are among the more readily got and appealing forms of data that may be used to flag specimens as warranting taxonomic attention (Goldstein and DeSalle, 2011). Based on their occurrences summarized in Figure 1F, “integrative taxonomy” and “flag” are not often used explicitly in connection with species “discovery.” This may suggest a disconnect between the appeal of species discovery in the abstract and its actual undertaking. If so, it highlights the important point that cryptic species discovered from DNA barcodes are not always accompanied by taxonomic revisionary work.

Since its inception, DNA barcoding has been bolstered by its utility for discovering cryptic species specifically as well as in taxonomic revision, forensics, conservation and biodiversity studies generally. Recognizing the potential bearing of cryptic species on each of these fields, Figure 1H illustrates that the study of cryptic species has consistently played a focal role in a range of fields over the 15-year period we examined, with explicit mention of conservation and taxonomy appearing with less frequent emphasis, followed by “forensic” and “biodiversity.”


Examinations of word usage are productive only to the degree that common ground in both meaning and intent is well-understood, and inferences from any compendium of word usage are only as good as the precision with which the search terms were originally used. Loose usage of terms like “diagnosis” or “tree” seem inevitable as barcoding tools become increasingly accessible. As genomic data are generated with increasing ease, it remains to be seen whether the enthusiasm for DNA as it is currently practiced will transition to the larger endeavor of archiving accessible genomic data.

The most obvious and important result of the exercises performed here is that distance or phenetic approaches have prevailed in DNA barcoding practices for reasons that appear to be more practical than scientific. Conflating distance data with diagnoses and algorithms with tree graphics are not uncommon mistakes in the taxonomic literature. Although the use of NJ trees or distances to diagnose species appears in the literature, we would argue that doing so obviates the real diagnostic value of barcode data that would meet the requirements of diagnoses set forth in the ICZN and elsewhere.

Distance-based methods have a well-established place in population genetics, where they play important roles in evaluating raw divergence among related individuals or populations. In the context of phylogenetic inference, however, clustering operations based on phenetic similarity have for several decades been rejected by systematists for empirical and statistical reasons, not the least of which is that since they combine available character data into a single ensemble metric, they cannot test or summarize specific character homologies that would otherwise contribute to a diagnosis (Ferguson, 2002; DeSalle, 2007; Little and Stevenson, 2007). Distance metrics are nevertheless easy to calculate and methods such as NJ generate dendrograms with a seeming minimum of ambiguity. The development of DNA barcode databases hinged on the ease of NJ precisely because of this computational ease, because any lack of decisiveness among the data is not transparent in seemingly unambiguous single tree that obtains from every NJ analysis.

There exists quite a bit of variation in the handling of dendrograms (distance based figures) generated by DNA barcodes for purposes following the organization of specimens. Many draw empirical conclusions directly from a given NJ tree instead of using it recursively to examine/interpret other characters or pieces of information. But how researchers use the tree to summarize variation and evaluate actual support for would-be relationships varies considerably. Phenetic trees, rapidly generated as they are, risk yielding spurious representations of data, and represent liabilities to the extent that apparent tree structure is uncorroborated.

Clustering algorithms and dendrograms are used throughout biology for purposes ranging from ecological community analysis to visualizing gene expression data. The use of trees in phylogenetic science is distinguished from other applications by the implied superposition of a temporal dimension that enables testing hypotheses of character evolution. At its simplest, this is achieved by establishing polarity, or the direction of character state change, through the operation of rooting, followed by optimization of hypothetical character states at nodes. Regardless of whether scientists imagine distance-generated trees to be “phylogenies,” neither of these operations is possible on such trees without violating the fundamental assumptions of rooting and optimization. A raw dendrogram, however it is generated, is simply a form of metadata that summarizes similarity using a given metric or optimality criterion; it cannot by itself serve to “diagnose” anything with reference to observable character states much less evaluate synapomorphy, establish monophyly, or test ideas of character evolution.

To the credit DNA barcoding's architects, it has been stressed that barcode trees are not intended to serve as phylogenies, and as the menu of tools available on BOLD has expanded to include features that enable proper diagnoses, it is our hope that the number of taxonomic papers perpetuating that error will one day subside. Our purpose is not to belabor this any further, but to stress that despite their computational ease, NJ trees render barcode data under-utilized.


Inevitably, whenever a new tool is developed that expedites a set of tasks, the training required prior to that development becomes at least partly obsolete, and it becomes easy to overlook standards—obsolete or not—that went along with it. In this case those standards range from matters as straightforward as species diagnosis to the more nuanced interpretation of molecular phylogenetic trees. It has at times appeared as though the antiquated view of systematics as an exercise in naming things, rather than an empirical endeavor to reconcile classifications with evolutionary hypotheses, has persisted. Graphic summary statements of phylogenetic data are rarely as decisive as they appear when stripped of their analytical details, and from the taxonomy-as-nomenclature perspective, systematics is seen as a pedantic holdover of Victorian pseudo-science, its practices the relics of a bygone era, and the very existence of undescribed species or unstable classification the function of some intrinsic psycho-intellectual flaw known collectively as the “taxonomic impediment” rather than a reflection of the raw magnitude of biodiversity. Similar brands of taxonomic naïvete have manifested elsewhere, as in recent debates over wisdom of taxonomic descriptions using photographs as “types.” (Garraffoni and Freitas, 2017; see also Amorim et al., 2016, Ceríaco et al., 2016, Pape, 2016, Santos et al., 2016). Although hailed as a possible solution to the taxonomic impediment, DNA barcoding performed uncritically risks the encumbrance of subsequent efforts and defeats its own purpose.

It seems generally accepted that, with exceptions in various groups ranging from genera to families, conventional barcode analyses work quite well in circumscribing potentially recognizable species that can be further corroborated with other characters. Why then be concerned about using distance measures as arbiters of identity? Although this paper is no place to resurrect a discussion on species concepts, there is nothing mysterious about the fact that barcode analyses tend to predict species that are ultimately recognizable by other means—certainly the rigorous evaluation of candidate loci undertaken before settling on COI has resolved that much. But it is important to separate the statement that NJ analyses “work” to identify species from the supposition that they allow us to infer anything about species in the abstract. The premise of the claim that NJ works to identify species united by some abstracted metaphysical property is that the species criterion is unspecified. This is not mere sophistry: Without establishing or allowing for an independent criterion for corroboration, there can be no means of evaluating what works and what does not because the claim is fundamentally unfalsifiable. If we adopt the perspective that species—whatever evolutionary concepts to which they may or may not conform—can be palatably recognized by congruent character data, then accepting provisional clusters as working hypotheses subject to further corroboration is quite reasonable. In other words, the fact that a very high proportion of diagnosable species are captured by NJ analyses is encouraging, but not sufficient. We maintain simply that even a small a small percentage of species overlooked or misdiagnosed warrant acknowledgment and the arbitrariness of inferring a universal distance measure is unnecessary when the means exist for quantifying diagnostic features directly.

DNA barcoding represents a tool with a range of empirical uses as broad as the array of taxa and available specimens with accompanying barcodes. Although these empirical uses do not extend to rigorous phylogenetic testing, barcode data realize their greatest potential throughout the recursive process of taxonomic investigation. In our view, the coupling of DNA barcoding with distance methods rendered its potential as a taxonomic tool under-realized. Although we actively embrace DNA barcoding in our own taxonomic research and as a near-universal advance for taxonomic research in general, we reject the premise that DNA barcoding serves to repair some inherent flaw in the practice of systematics. We view the taxonomic impediment not as a manifestation of human-induced shortcomings but as a reflection of the magnitude of global species richness.

We hope to have distinguished methodological issues from semantic ones, by pointing out, for example, the percent differences are by definition mathematically non-diagnostic. But our primary is not to redress common practices, but to suggest that more could be gained from additional analyses that would serve the formal taxonomic goals of diagnosis. It is not our intent to cast a pall over the use of barcode data to uncover diversity at fine scales, but to articulate how those data may continue to be enhanced. We stress the importance of not over-stating the implications of a word survey; our hope is merely to have provided a crude calibration of how quickly we might reasonably expect to see significant shifts in how barcode data are analyzed. A conclusion of this exercise is that researchers are more likely to follow the examples of their peers and use the tools most readily available than they are to ponder the minutiae of evolutionary analyses.

Author Contributions

Both authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.


The authors are solely responsible for the writing of this paper.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor and reviewer, RH, declared their involvement as co-editors in the Research Topic, and confirm the absence of any other collaboration.


RD acknowledges the Institute for Comparative Genomics at the AMNH (ICG-AMNH) and the Lewis and Dorothy Cullman Program in Molecular Systematics and the Korein Family for continued support. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA; USDA is an equal opportunity provider and employer.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fevo.2019.00302/full#supplementary-material


Amorim, D. S., Santos, C. M., Krell, F. T., Dubois, A., Nihei, S. S., Oliveira, O. M., et al. (2016). Timeless standards for species delimitation. Zootaxa 4137, 121–128. doi: 10.11646/zootaxa.4137.1.9

PubMed Abstract | CrossRef Full Text | Google Scholar

Andersen, K., Bird, K. L., Rasmussen, M., Haile, J., Breuning-Madsen, H., Kjaer, K. H., et al. (2012). Meta-barcoding of ‘dirt'DNA from soil reflects vertebrate biodiversity. Mol. Ecol. 21, 1966–1979. doi: 10.1111/j.1365-294X.2011.05261.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Avise, J. C., Arnold, J., Ball, R. M., Bermingham, E., Lamb, T., Neigel, J. E., et al. (1987). Bridge between population, genetics and systematics. Ann. Rev. Ecol. Syst. 18, 489–522. doi: 10.1146/annurev.es.18.110187.002421

CrossRef Full Text | Google Scholar

Brower, A. V. Z. (1999). Delimitation of phylogenetic species with DNA sequences: a critique of Davis and Nixon's population aggregation analysis. Syst. Biol. 48, 199–213. doi: 10.1080/106351599260535

PubMed Abstract | CrossRef Full Text | Google Scholar

Cameron, S. L., Lambkin, C. L., Barker, S. C., and Whiting, M. F. (2007). A mitochondrial genome phylogeny of Diptera: Whole genome sequence data accurately resolve relationships over broad timescales with high precision. Syst. Entomol. 32, 40–59. doi: 10.1111/j.1365-3113.2006.00355.x

CrossRef Full Text | Google Scholar

Ceríaco, L. M., Gutiérrez, E. E., and Dubois, A. (2016). Photography-based taxonomy is inadequate, unnecessary, and potentially harmful for biological sciences. Zootaxa 4196, 435–445. doi: 10.11646/zootaxa.4196.3.9

PubMed Abstract | CrossRef Full Text | Google Scholar

Cheng, L., Connor, T. R., Sirén, J., Aanensen, D. M., and Corander, J. (2013). Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol. Biol. Evol. 30, 1224–1228. doi: 10.1093/molbev/mst028

PubMed Abstract | CrossRef Full Text | Google Scholar

Davis, J. I., and Nixon, K. C. (1992). Populations, genetic variation, and the delimitation of phylogenetic species. Syst. Biol. 41, 421–435. doi: 10.1093/sysbio/41.4.421

CrossRef Full Text | Google Scholar

DeSalle, R. (2006). Species discovery versus species identification in DNA barcoding efforts: response to Rubinoff. Conserv. Biol. 20, 1545–1547. doi: 10.1111/j.1523-1739.2006.00543.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Doyle, J. J. (1995). The irrelevance of allele tree topologies for species delimitation, and a non-topological alternative. Syst. Bot. 20, 574–588.

Google Scholar

Farris, J. S. (1980). The efficient diagnoses of the phylogenetic system. Syst. Zool. 29, 386–401. doi: 10.2307/2992344

CrossRef Full Text | Google Scholar

Felsenstein, J. (1985). Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791.

PubMed Abstract | Google Scholar

Ferguson, J. W. H. (2002). On the use of genetic divergence for identifying species. Biol. J. Linn. Soc. 75, 509–516. doi: 10.1046/j.1095-8312.2002.00042.x

CrossRef Full Text | Google Scholar

Fujita, M. K., Leaché, A. D., Burbrink, F. T., McGuire, J. A., and Moritz, C. (2012). Coalescent-based species delimitation in an integrative taxonomy. Trends Ecol. Evol. 27, 480–488. doi: 10.1016/j.tree.2012.04.012

PubMed Abstract | CrossRef Full Text | Google Scholar

Funk, D. J., and Omland, K. E. (2003). Species-level paraphyly and polyphyly: Frequency, causes, and consequences, with insights from animal mitochondrial DNA. Annu. Rev. Ecol. Evol. Syst. 34, 397–423. doi: 10.1146/annurev.ecolsys.34.011802.132421

CrossRef Full Text | Google Scholar

Goldstein, P. Z., and DeSalle, R. (2011). Integrating DNA barcode data with taxonomic practice: Determination, discovery, and description. Bioessays 33, 135–147. doi: 10.1002/bies.201000036

PubMed Abstract | CrossRef Full Text | Google Scholar

Hajibabaei, M., Janzen, D. H., Burns, J. M., Hallwachs, W., and Hebert, P. D. (2006). DNA barcodes distinguish species of tropical Lepidoptera. Proc Natl Acad Sci U.S.A. 103, 968–971. doi: 10.1073/pnas.0510466103

PubMed Abstract | CrossRef Full Text | Google Scholar

Hausmann, A., Miller, S. E., Holloway, J. D., deWaard, J. R., Pollock, D., Prosser, S. W., et al. (2016). Calibrating the taxonomy of a megadiverse insect family: 3000 DNA barcodes from geometrid type specimens (Lepidoptera, Geometridae). Genome 59, 671–684. doi: 10.1139/gen-2015-0197

PubMed Abstract | CrossRef Full Text | Google Scholar

Hebert, P. D., Cywinska, A., Ball, S. L., and deWaard, J. R. (2003a). Biological identifications through DNA barcodes. Proceedings of the Royal Society of London. Series B: Biological Sciences 270, 313–321. doi: 10.1098/rspb.2002.2218

PubMed Abstract | CrossRef Full Text | Google Scholar

Hebert, P. D., Ratnasingham, S., and deWaard, J. R. (2003b). Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc. R. Soc. Lond. Ser. B Biol. Sci. 270, S96–S99. doi: 10.1098/rsbl.2003.0025

PubMed Abstract | CrossRef Full Text | Google Scholar

Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., and Madden, T. L. (2008). NCBI BLAST: a better web interface. Nucleic Acids Res. 36, W5–W9. doi: 10.1093/nar/gkn201

PubMed Abstract | CrossRef Full Text | Google Scholar

Jombart, T., Devillard, S., and Balloux, F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 11:94. doi: 10.1186/1471-2156-11-94

PubMed Abstract | CrossRef Full Text | Google Scholar

Jörger, K. M., and Schrödl, M. (2013). How to describe a cryptic species? Practical challenges of molecular taxonomy. Front. Zool. 10:59. doi: 10.1186/1742-9994-10-59

CrossRef Full Text | Google Scholar

Kress, W. J., García-Robledo, C., Uriarte, M., and Erickson, D. L. (2015). DNA barcodes for ecology, evolution, and conservation. Trends Ecol. Evol. 30, 25–35. doi: 10.1016/j.tree.2014.10.008

PubMed Abstract | CrossRef Full Text | Google Scholar

Leavitt, J. R., Hiatt, K. D., Whiting, M. F., and Song, H. (2013). Searching for the optimal data partition- ing strategy in mitochondrial phylogenomics: a phylogeny of Acridoidea (Insecta: Orthoptera: Caelifera) as a case study. Mol. Phylogenet. Evol. 67, 494–508. doi: 10.1016/j.ympev.2013.02.019

CrossRef Full Text | Google Scholar

Little, D. P., and Stevenson, D. W. (2007). A comparison of algorithms for the identification of specimens using DNA barcodes: examples from gymnosperms. Cladistics 23, 1–21. doi: 10.1111/j.1096-0031.2006.00126.x

CrossRef Full Text | Google Scholar

Matz, M. V., and Nielsen, R. (2005). A likelihood ratio test for species membership based on DNA sequence data. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360, 1969–1974. doi: 10.1098/rstb.2005.1728

PubMed Abstract | CrossRef Full Text | Google Scholar

Monaghan, M. T., Wild, R., Elliot, M., Fujisawa, T., Balke, M., Inward, D. J., et al. (2009). Accelerated species inventory on Madagascar using coalescent-based models of species delineation. Syst. Biol. 58, 298–311. doi: 10.1093/sysbio/syp027

PubMed Abstract | CrossRef Full Text | Google Scholar

Prendini, L. (2005). Comment on “Identifying spiders through DNA barcodes”. Can. J. Zool. 83, 498–504. doi: 10.1139/z05-025

CrossRef Full Text | Google Scholar

Puillandre, N., Lambert, A., Brouillet, S., and Achaz, G. (2012). ABGD, Automatic Barcode Gap Discovery for primary species delimitation. Mol. Ecol. 21, 1864–1877. doi: 10.1111/j.1365-294X.2011.05239.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Rubinoff, D. (2006b). Barcodes, integrated. DNA barcoding evolves into the familiar. Conserv. Biol. 20, 1548–1549. doi: 10.1111/j.1523-1739.2006.00542.x

CrossRef Full Text | Google Scholar

Santos, C. M. D., Amorim, D. S., Klassa, B., Fachin, D. A., Nihei, S. S., De Carvalho, C. J. B., et al. (2016). On typeless species and the perils of fast taxonomy. Syst. Entomol. 41, 511–515. doi: 10.1111/syen.12180

CrossRef Full Text | Google Scholar

Sarkar, I. N., Planet, P. J., and Desalle, R. (2008). CAOS software for use in character-based DNA barcoding. Mol. Ecol. Resour. 8, 1256–1259. doi: 10.1111/j.1755-0998.2008.02235.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Shokralla, S., Spall, J. L., Gibson, J. F., and Hajibabaei, M. (2012). Next-generation sequencing technologies for environmental DNA research. Mol. Ecol. 21, 1794–1805. doi: 10.1111/j.1365-294X.2012.05538.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Sickel, W., Ankenbrand, M. J., Grimmer, G., Holzschuh, A., Härtel, S., Lanzen, J., et al. (2015). Increased efficiency in identifying mixed pollen samples by meta-barcoding with a dual-indexing approach. BMC Ecol. 15:20. doi: 10.1186/s12898-015-0051-y

PubMed Abstract | CrossRef Full Text | Google Scholar

Stoeckle, M. (2003). Taxonomy, DNA, and the bar code of life. Bioscience 53, 796–797. doi: 10.1641/0006-3568(2003)053[0796:TDATBC]2.0.CO;2

CrossRef Full Text | Google Scholar

Wiemers, M., and Fiedler, K. (2007). Does the DNA barcoding gap exist?–a case study in blue butterflies (Lepidoptera: Lycaenidae). Front. Zool. 4:8. doi: 10.1186/1742-9994-4-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Zahiri, R., Lafontaine, J. D., Schmidt, B. C., deWaard, J. R., Zakharov, E. V., and Hebert, P. D. N. (2017). Probing planetary biodiversity with DNA barcodes: The Noctuoidea of North America. PLoS ONE 12:e0178548. doi: 10.1371/journal.pone.0178548

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, J., Kapli, P., Pavlidis, P., and Stamatakis, A. (2013). A general species delimitation method with applications to phylogenetic placements. Bioinformatics 29, 2869–2876. doi: 10.1093/bioinformatics/btt499

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: DNA barcode, phylogenetics, diagnosis, species delimitation, specimen identification

Citation: DeSalle R and Goldstein P (2019) Review and Interpretation of Trends in DNA Barcoding. Front. Ecol. Evol. 7:302. doi: 10.3389/fevo.2019.00302

Received: 15 March 2019; Accepted: 26 July 2019;
Published: 10 September 2019.

Copyright © 2019 DeSalle and Goldstein. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Rob DeSalle, desalle@amnh.org

Sours: https://www.frontiersin.org/articles/10.3389/fevo.2019.00302/full

Biological identifications through DNA barcodes.

Although much biological research depends upon species diagnoses, taxonomic expertise is collapsing. We are convinced that the sole prospect for a sustainable identification capability lies in the construction of systems that employ DNA sequences as taxon 'barcodes'. We establish that the mitochondrial gene cytochrome c oxidase I (COI) can serve as the core of a global bioidentification system for animals. First, we demonstrate that COI profiles, derived from the low-density sampling of higher taxonomic categories, ordinarily assign newly analysed taxa to the appropriate phylum or order. Second, we demonstrate that species-level assignments can be obtained by creating comprehensive COI profiles. A model COI profile, based upon the analysis of a single individual from each of 200 closely allied species of lepidopterans, was 100% successful in correctly identifying subsequent specimens. When fully developed, a COI identification system will provide a reliable, cost-effective and accessible solution to the current problem of species identification. Its assembly will also generate important new insights into the diversification of life and the rules of molecular evolution.

Sours: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1691236/
  1. Jack sparrow hd wallpaper
  2. Condos for rent neenah wi
  3. Cape cod hospital physician directory

DNA barcoding: a six-question tour to improve users' awareness about the method


DNA barcoding is a recent and widely used molecular-based identification system that aims to identify biological specimens, and to assign them to a given species. However, DNA barcoding is even more than this, and besides many practical uses, it can be considered the core of an integrated taxonomic system, where bioinformatics plays a key role. DNA barcoding data could be interpreted in different ways depending on the examined taxa but the technique relies on standardized approaches, methods and analyses. The existing reference towards a common way to treat DNA barcoding data, analyses and results is the Barcode of Life Data Systems. However, the scientific community has produced in the recent years a number of alternative methods to manage barcoding data. The present work starts from this point, because users should be aware of the consequences their choices produce on the results. Despite the fact that a strict standardization is the essence of DNA barcoding, we propose a tour of six questions to improve the users' awareness about the method, the correct use of concepts and alternative tools provided by scientific community.

DNA barcoding, molecular identification, species identification, DNA taxonomy


DNA barcoding is a molecular and bioinformatic tool that aims to identify biological species [1]. The basic idea is quite simple: through the analysis of the variability in a single (or few) standard molecular marker(s), it is possible to discriminate biological entities (hopefully belonging to the species taxonomic rank). This method relies on the assumption that the genetic variation between species exceeds that within species. Consequently, the ideal DNA barcoding analysis mirrors the distributions of intra- and inter-specific variabilities separated by a distance called ‘DNA barcoding gap’ [2, 3].

The original idea was to apply DNA barcoding systematically to all metazoans, by the use of one or few (mitochondrial) markers (e.g. coxI [1]; see also question 3). Rapidly, but with less coherent results, the idea was extended to flowering plants [4, 5] and fungi [6], and now the DNA barcoding initiative can be considered a tool suitable for all the tree of life branches (even for Bacteria and Archaea, using a multi-locus sequences typing, see [7]). However, DNA barcoding is actually a method working with success in metazoans and fungi, with problems in plants. Efforts in DNA barcoding development and management are coordinated by the Consortium for the Barcode of Life (CBoL; http://barcoding.si.edu/).

One of the major properties of a DNA barcode is the possibility to easily associate all life history stages and genders (in particular when the morphology, living behaviour, habitat are consistently different) or to identify organisms from part/pieces (for example, when parasites are recovered by physicians or veterinarians, or in the case of slices of food analysed for traceability purposes) or to discriminate a matrix containing a mixture of biological species (i.e. in the technique known as ‘environmental pyrosequencing’) [8–10]. Quite soon it became clear that DNA barcoding was suitable for two different purposes: (i) the molecular identification of already described species [1] and (ii) the discovery of undescribed species [11].

DNA barcoding has generated a vast debate in the scientific community, which has been from the beginning, deeply divided into pros and cons. It is not the aim of our work to follow this debate, but an idea of it can be obtained from the references [2, 12–20]. Our goal is rather the opposite. Stated that DNA barcoding is a method, and like all methods it can be more or less fallacious, we described and critically classified the principal analytical approaches proposed by the scientific community, taking into account that the failures are mainly in the essence of biological species, in the patterns of molecular evolution, in the completeness of sampling, in the hybridization events and in the heteroplasmy of sequences from different tissues rather than in the method [16, 21–32].

Hence, what is the revolution introduced by DNA barcoding? In our opinion, the big leap is not only in the discrimination power itself, but also resides in the conjugation of three innovations of modern taxonomy: (i) molecularization (i.e. the use of the variability in a molecular marker as a discriminator); (ii) computerization (i.e. the not redundant transposition of the data using informatics supports) and (iii) standardization (i.e. the extension of the approach to vast groups of organisms not deeply related) of the taxonomic approach. Molecularization [33–35] and computerization [36–39] have been independently present in the taxonomic world for a long time; from this derives some superficial criticisms relative to the absence of innovation in DNA barcoding (see for instance [17]). Standardization was randomly present in the taxonomic world: for instance, the international codes for nomenclature were big efforts in the direction of a common language among researchers. However, from the practical point of view, standardization of the approaches was less a demand in the specialized taxonomic world. For the first time in years, by DNA barcoding, it is possible to introduce in taxonomy a generalization, allowing researchers specialized in different fields to work on a shared framework.

In the space of few years DNA barcoding has moved from fantasy to reality. In some of the first enthusiastic reports, DNA barcoding was even claimed as the way to make true the dreams of Gene Roddenberry, the creator of the science fiction drama Star Trek, proposing the creation of a tool for organism identification, the DNA barcoder, a homologous to the fictional Tricorder [40]. A few years later we are not yet in the spaceship enterprise, but DNA barcoding has deeply impacted the scientific community, becoming a widely used approach.

The aim of the present work is to draw a sort of ‘behavioural code’ for the users, raising some key questions about the focal DNA barcoding aspects. These questions will be also a pretext to follow and summarize the development of the techniques and bioinformatics used in DNA barcoding analyses, in order to allow users to wittingly choose among the different approaches to DNA barcoding, in absence of a shared analytical procedure.


DNA barcoding is a standardized and automated system for the identification of living beings [1, 41]. However, at the present state of the art, the most relevant DNA barcoding tool, the Barcode of Life Data Systems, BOLD (http://www.barcodinglife.org/ [42]) is still in constant evolution and update. The majority of the published works performs a simple distance matrix analysis, using a Neighbour Joining (NJ) algorithm, with a Kimura 2 parameters (K2P) correction (see for instance [9, 43–46]). The feeling is that they are using NJ 2KP more by routine than reasoned choice. This approach is, at least, disputable [29], even if the Kimura correction was claimed as the best DNA substitution model for low genetic distances [47].

Most of the questions raised by the use of DNA barcoding are directly linked to the essence of an identification method. In a strict sense, to identify means simply to differentiate (i.e. a species could be defined only in relation to other species [48]). The choice of the discriminator is essential, because it is (almost) always possible to differentiate: the difficulty is in giving a biological meaning to what it has been discriminated.

Even if not always fully acknowledged, DNA barcoding implies two different approaches to discrimination. DNA barcoding sensu stricto is a simple sorting method that could differentiate biological entities. We underline: discriminate, not define them. It should be clear that DNA barcoding does not claim for a new species concept, this being absolutely not necessary to the success of the method. Indeed, the species concepts used to deal with the molecular entities identified in DNA barcoding analyses are already available in the scientific community. However, the choice of a proper species concept is crucial to perform a reasoned DNA barcoding analysis (see also question 2). DNA barcoding sensu stricto is not significantly different from a dichotomic key in the traditional taxonomy framework. DNA barcoding sensu lato represents a system that is the true sense of taxonomy [49–51]. The discrimination method itself can be considered the epiphenomenon, and the subject of major criticisms (DNA barcoding sensu stricto), but it becomes a system implementing all the aspects of taxonomy towards the representation of the living world as a whole (DNA barcoding sensu lato) [52–56].

It should be clear to the users which kind of DNA barcoding philosophy they are going to use.


In a provocative way, someone can state that it is never possible (or, at least, potentially misleading) to identify species through the analysis of genetic divergence [26, 57, 58]. It is well known that no identification method (morphological, biochemical, genetic or whatsoever based) can truly identify species, because species are entities in continuous evolution and it is theoretically impossible to define statically such dynamic matter. DNA barcoding, in its original generalization, follows the typological species approach, a concept that theoretically fails because it freezes the evolutionary continuum of species. To cope with this limitations, some development of DNA barcoding shifted towards other species concepts (see below).

The entities identified by molecular approaches have been named in several ways: ‘Genospecies’; ‘Phylospecies’, sensu [59]; ‘Recognizable Taxonomic Units’, RTUs, sensu [60]; ‘Phylotypes’ sensu [61]; ‘Molecular Operational Taxonomic Units’, ‘MOTUs’, sensu [62].

The major issue is how close are those, whatever named, molecular entities to what we are used to call ‘species’. Even if the point has been clearly treated by different authors (see, for instance [19, 63–66]), a general naïve assumption considers ‘molecular entities’ and ‘species’ as synonyms. But when is a molecular entity equal to a species? And what is the meaning of a molecular entity that does not match a described species? DNA barcoding deals with the boundaries among races, varieties, demes, populations, species, a well-known tricky field. Darwin [67] defined both the species and the variety as groups of individuals arbitrarily named by experts ‘for the sake of convenience’. Mallet [58] has significantly supported Darwin's; thoughts. This is the (almost) insurmountable problem for DNA barcoding sensu stricto: the biological meaning of the identified ranks cannot be directly derived, unless we have clearly and unequivocally linked a species to the variability pattern of a single DNA barcoding marker. In all the other cases, we need DNA barcoding sensu lato [53, 68, 69].

The identification and then the interpretation of molecular entities is the main goal of DNA barcoding that could be reached only by users with a sound theoretical background on what is identifiable by this technique. We believe that the users do not need to mention this debate in their works, but its consequences should be clear in their results and discussions, to avoid confusion and a misleading interpretation of the data.


DNA barcoding is not coxI only. A precise portion in the 5′ end region of this mitochondrial gene has been proposed as a standard for metazoans ([70]; Table 1). Even if coxI has proven to be useful to discriminate species in most groups tested, its limits in some animal taxa are already evident [2, 46, 71]. In spite of these problems, coxI is the main marker for DNA barcoding purposes in metazoans, as revealed by the high number of published projects (see Project section at http://www.boldsystems.org). The choice of regions usable for DNA barcoding has been little investigated in many other eukaryotes. For instance, a marker was already available in fungi: the nuclear ITS region, and it has been now confirmed as the main DNA barcode for this group ([72]; Table 1), even if coxI has been successfully tested on these organisms [73]. In land plants, compared to animals, mitochondrial DNA has slower substitution rates and shows intra-molecular recombination [74], thus plant coxI exhibits a low rate of evolution, limiting its capabilities to univocally identify plant taxa. The search for an analogous to coxI or ITS in land plants, that matches with the DNA barcoding criteria, has focused attention on the plastid genome. Several plastid genes have been proposed, such as the most conserved rpoB, rpoC1 and rbcL or a section of matK showing a rapid rate of evolution, but in some plant families, these genes showed amplification problems (see [5]). At the same time, the intergenic spacers such as trnH-psbA, atpF-atpH and psbK-psbI were tested for their rapid evolution [75, 76], but showed standardization and sequence alignment problems. Recently, the CBoL Plant Working Group [77] provided a recommendation on a standard plant barcode suggesting the 2-locus combination of rbcL and matK (see Table 1).

Table 1:

List of the main taxa-specific DNA barcoding campaigns published on BOLD and related successfully tested markers

Open in new tab

Table 1:

List of the main taxa-specific DNA barcoding campaigns published on BOLD and related successfully tested markers

Open in new tab

DNA barcoding data are meant to be easily and widely accessed and to gain this purpose a proper sequence submission procedure is available for GenBank (http://www.ncbi.nlm.nih.gov/WebSub/?tool=barcode). This procedure slightly modifies the standard sequence submission procedure, introducing a DNA barcoding label to the sequence to simplify database querying and searching. Moreover, additional data are requested to link barcode sequence data to its voucher specimen. This standardization is mirrored by the establishment of the Registry of Biological Repositories initiative (http://www.biorepositories.org/), an online registry of organisms linked to DNA sequences.

DNA barcoding sequences can also be deposited as projects in BOLD databases that are characterized by an automatic submission tool to publish sequences to GenBank.


DNA barcoding is a standardized method, as little as possible, taxon influenced. Design ad hoc primers could be the only choice in some cases, but as a general principle, primers working on few or even only one genera are not really in agreement with DNA barcoding philosophy, that is instead based on the possibility of amplifying, in a single PCR run and, hopefully, with the same primers, the same barcode region in different taxa. As a first approach to a DNA barcoding work, it is correct to test some widely used primers. In BOLD the registration of DNA barcoding primers for the different barcodes is available and encouraged (http://www.boldsystems.org/views/primerlist.php). In animals, pairs and/or cocktails of coxI primers have been successfully used [78]. In conclusion, for a DNA barcoding user approaching to the characterization of an animal group never tested before, it should be easy to try a vast collection of primers and conditions. Quite different is the situation in plants, where we are presently facing most of the difficulties for primer design. Considering the two best candidates rbcL and matK, only for the first a universal primer combination suitable for all plant species was defined [76]. On the contrary, matK was analysed in different plants and several reports were published regarding the universality of the primers, ranging from routine success [79], to more patchy recovery [5, 39, 75]. In conclusion, for plant DNA barcoding users, there are more doubts. In any case, since few months, there is, at least, a standard proposal to be followed [77].

In fungi there has been a large debate about the best barcode region, but at the present state of the art, the community has decided to focus mainly on the ITS region, for which a list of primers and related amplification conditions is now available.


A proper DNA barcode must have a minimum length. Accordingly to BOLD-IDS (see question 6, first answer), a quiered coxI or rbcL or matK sequence has to be at least 500 pb long, while an ITS query sequence has to be at least 100 pb. Smaller coxI barcode sequences have also been taken into account as ‘minibarcodes’ especially for degraded samples of metazoans [80].

Nuclear mitochondrial DNA (numts), and more in general pseudogenes, are non-functional copies of an original gene that can be amplified instead, or coupled, with the functional copy. When these entities are copies of a DNA barcode, they are referred to as ‘DNA barcode-like sequences’. Several studies show that: (i) the occurrence of these sequences is widespread among different taxa [23, 27, 81]; (ii) inclusion of these sequences in a DNA barcoding study can strongly influence the accuracy of subsequent analyses (e.g. overestimating inter-specimen variability, see [27]). It is hence fundamental to detect and delete pseudogenes from the reference dataset.

Song et al. [27] and Buhay [81] suggested step-by-step procedures in order to identify possible pseudogenes. BOLD itself provides a quality control tool to check sequences for the presence of stop codons and verify that they derive from coxI by comparing them against a Hidden Markov Model. In any case, it should be noted that even with these procedures the impact of numts can be overlooked [27]. At the moment, numts are a serious problem for the analyses based on mitochondrial and plastid genes.


As stated before, there is no consensus on the best method to analyse DNA barcoding data. This is confirmed by the publication of works in which researchers tested different approaches on the same dataset [82–84]. In most of the cases, the result is the same: there is no analytical method outperforming the others, but the ‘best method’ is case related. In such a dynamic and fluid situation it is necessary that users get more and more acquainted with the bioinformatics of DNA barcoding. At the same time, users should learn how to properly manage data, to avoid errors and incorrect interpretation of the results. It should be underlined that the objective of DNA barcoding analyses is relatively simple: to assign each query sequence to a set of referenced (tagged-specimen) sequences extracted, for instance, from databases like BOLD (see below). A profusion of different bioinformatics approaches is available to reach this aim. In this point, we propose a classification of the main cases a DNA barcoding user can face, and the possible bioinformatics methods usable, evidencing pros and cons and the implemented species concept. Indeed, in disagreement with Ferguson [57], we think that a species concept is always implicit in a method aiming at the identification of biological entities.

Advances in sequencing and computational technologies transformed DNA barcoding in an ambitious initiative, formed by different projects converging in the single aim of CBoL to create a universal system for a living beings inventory: the BOLD system [42]. By February 2010 BOLD database encompassed more than 790 000 sequences, corresponding to more than 67 000 formally described ‘species’. The amount of data managed by BOLD database is impressive: it collects, for a large amount of deposited barcode sequences, specimen’s details such as morphology, photographs, geographical distribution, collection points and others [42].

Another recent Web-based tool that was made available is BioBarcode, a data-processing system, based on open source software [85]. BioBarcode is directed to the collection of Asiatic organisms, and by now encompasses around 11 300 specimen entries.

For every method considered in the following parts, Table 2 gives a summary of Web resources, if available, and references.

Table 2:

Summarize of the methods classified in the question 6, with details on the availability, web resources and authors contacts

Tipology . Method(s) . Software/tool(s) . Resources . References . 
Threshold (distance) Similarity Blastall–BLASTn ftp://ftp.ncbi.nih.gov/blast/[93] 
Similarity BLAT http://genome-test.cse.ucsc.edu/ ∼kent/exe/[94] 
Similarity Blastall–megaBLAST ftp://ftp.ncbi.nih.gov/blast/[90] 
Pairwise distance TaxI [email protected] [97] 
Pairwise distance TaxonDNA http://taxondna.sf.net/[100] 
K2P distance MUSCLE, MEGA [email protected] [69] 
K2P distance BOLD-IDS http://www.barcodinglife.org/ views/idrequest.php[42] 
Patristic distance MrBayes, PAUP, APE, Perl scripts [email protected] [101] 
Phylogenetic Neighbour Joining MUSCLE, MEGA [email protected] [82] 
Parsimony MUSCLE, TNT [email protected] [90] 
Maximum likelihood MUSCLE, SPR1, PHYML2 http://atgc.lirmm.fr/spr/[82] 
Bayesian inference SAP http://fisher.berkeley.edu/cteg/ software/munch[102, 103] 
Coalescent based [email protected] [108] 
Coalescent based [email protected] [87] 
Coalescent based COALESCENCE, FLUCTUATE, PAUP, Seq-Gen [email protected] [88] 
Coalescent based COAL, MESQUITE [email protected] [109] 
Coalescent based general mixed Yule-coalescent (GMYC) model [email protected] [112] 
Character based Diagnostic CAOS http://www.genomecurator.org/ CAOS/CAOSindex.html[118] 
Diagnostic MATLAB, local perl scripts [email protected] [121] 
Diagnostic DNA-BAR (degenbar) http://dna.engr.uconn.edu/ ∼software/DNA-BAR/[120] 
Diagnostic DOME ID (local perl scripts) [email protected] [90] 
Combined Yule model/coalescence TCS, MEGA, Arlequin, PAUP, PAUPRat script, Phylip, r8s, R http://www.imedea.uib.es/ ∼jpons/JPWPhome.htm[124] 
BLAST/parsimony ratchet BLAST, MUSCLE, TNT [email protected] [90] 
BLAST/SPR BLAST, MUSCLE, SPR [email protected] [90] 
BLAST/Neighbour Joining BLAST, MUSCLE, neighbour [email protected] [90] 
Alignment-free Tree-based ATIM: TNT, local scripts [email protected] [90] 
Component vector CVTree alpha 1.0 http://cvtree.cbi.pku.edu.cn[127, 128] 
Spectrum kernel method Spectrum [email protected] [129] 
Web tool Web browser http://www.ibarcode.org[56] 
Web browser http://www.dnabarcodelinker.com/[56] 
Web browser http://www.asianbarcode.org/[85] 
Other ConFind, Python http://www.colorado.edu/ chemistry/RGHP/software/[119] 
Tipology . Method(s) . Software/tool(s) . Resources . References . 
Threshold (distance) Similarity Blastall–BLASTn ftp://ftp.ncbi.nih.gov/blast/[93] 
Similarity BLAT http://genome-test.cse.ucsc.edu/ ∼kent/exe/[94] 
Similarity Blastall–megaBLAST ftp://ftp.ncbi.nih.gov/blast/[90] 
Pairwise distance TaxI [email protected] [97] 
Pairwise distance TaxonDNA http://taxondna.sf.net/[100] 
K2P distance MUSCLE, MEGA [email protected] [69] 
K2P distance BOLD-IDS http://www.barcodinglife.org/ views/idrequest.php[42] 
Patristic distance MrBayes, PAUP, APE, Perl scripts [email protected] [101] 
Phylogenetic Neighbour Joining MUSCLE, MEGA [email protected] [82] 
Parsimony MUSCLE, TNT [email protected] [90] 
Maximum likelihood MUSCLE, SPR1, PHYML2 http://atgc.lirmm.fr/spr/[82] 
Bayesian inference SAP http://fisher.berkeley.edu/cteg/ software/munch[102, 103] 
Coalescent based [email protected] [108] 
Coalescent based [email protected] [87] 
Coalescent based COALESCENCE, FLUCTUATE, PAUP, Seq-Gen [email protected] [88] 
Coalescent based COAL, MESQUITE [email protected] [109] 
Coalescent based general mixed Yule-coalescent (GMYC) model [email protected] [112] 
Character based Diagnostic CAOS http://www.genomecurator.org/ CAOS/CAOSindex.html[118] 
Diagnostic MATLAB, local perl scripts [email protected] [121] 
Diagnostic DNA-BAR (degenbar) http://dna.engr.uconn.edu/ ∼software/DNA-BAR/[120] 
Diagnostic DOME ID (local perl scripts) [email protected] [90] 
Combined Yule model/coalescence TCS, MEGA, Arlequin, PAUP, PAUPRat script, Phylip, r8s, R http://www.imedea.uib.es/ ∼jpons/JPWPhome.htm[124] 
BLAST/parsimony ratchet BLAST, MUSCLE, TNT [email protected] [90] 
BLAST/SPR BLAST, MUSCLE, SPR [email protected] [90] 
BLAST/Neighbour Joining BLAST, MUSCLE, neighbour [email protected] [90] 
Alignment-free Tree-based ATIM: TNT, local scripts [email protected] [90] 
Component vector CVTree alpha 1.0 http://cvtree.cbi.pku.edu.cn[127, 128] 
Spectrum kernel method Spectrum [email protected] [129] 
Web tool Web browser http://www.ibarcode.org[56] 
Web browser http://www.dnabarcodelinker.com/[56] 
Web browser http://www.asianbarcode.org/[85] 
Other ConFind, Python http://www.colorado.edu/ chemistry/RGHP/software/[119] 

Open in new tab

Table 2:

Summarize of the methods classified in the question 6, with details on the availability, web resources and authors contacts

Tipology . Method(s) . Software/tool(s) . Resources . References . 
Threshold (distance) Similarity Blastall–BLASTn ftp://ftp.ncbi.nih.gov/blast/[93] 
Similarity BLAT http://genome-test.cse.ucsc.edu/ ∼kent/exe/[94] 
Similarity Blastall–megaBLAST ftp://ftp.ncbi.nih.gov/blast/[90] 
Pairwise distance TaxI [email protected] [97] 
Pairwise distance TaxonDNA http://taxondna.sf.net/[100] 
K2P distance MUSCLE, MEGA [email protected] [69] 
K2P distance BOLD-IDS http://www.barcodinglife.org/ views/idrequest.php[42] 
Patristic distance MrBayes, PAUP, APE, Perl scripts [email protected] [101] 
Phylogenetic Neighbour Joining MUSCLE, MEGA [email protected] [82] 
Parsimony MUSCLE, TNT [email protected] [90] 
Maximum likelihood MUSCLE, SPR1, PHYML2 http://atgc.lirmm.fr/spr/[82] 
Bayesian inference SAP http://fisher.berkeley.edu/cteg/ software/munch[102, 103] 
Coalescent based [email protected] [108] 
Coalescent based [email protected] [87] 
Coalescent based COALESCENCE, FLUCTUATE, PAUP, Seq-Gen [email protected] [88] 
Coalescent based COAL, MESQUITE [email protected] [109] 
Coalescent based general mixed Yule-coalescent (GMYC) model [email protected] [112] 
Character based Diagnostic CAOS http://www.genomecurator.org/ CAOS/CAOSindex.html[118] 
Diagnostic MATLAB, local perl scripts [email protected] [121] 
Diagnostic DNA-BAR (degenbar) http://dna.engr.uconn.edu/ ∼software/DNA-BAR/[120] 
Diagnostic DOME ID (local perl scripts) [email protected] [90] 
Combined Yule model/coalescence TCS, MEGA, Arlequin, PAUP, PAUPRat script, Phylip, r8s, R http://www.imedea.uib.es/ ∼jpons/JPWPhome.htm[124] 
BLAST/parsimony ratchet BLAST, MUSCLE, TNT [email protected] [90] 
BLAST/SPR BLAST, MUSCLE, SPR [email protected] [90] 
BLAST/Neighbour Joining BLAST, MUSCLE, neighbour [email protected] [90] 
Alignment-free Tree-based ATIM: TNT, local scripts [email protected] [90] 
Component vector CVTree alpha 1.0 http://cvtree.cbi.pku.edu.cn[127, 128] 
Spectrum kernel method Spectrum [email protected] [129] 
Web tool Web browser http://www.ibarcode.org[56] 
Web browser http://www.dnabarcodelinker.com/[56] 
Web browser http://www.asianbarcode.org/[85] 
Other ConFind, Python http://www.colorado.edu/ chemistry/RGHP/software/[119] 
Tipology . Method(s) . Software/tool(s) . Resources . References . 
Threshold (distance) Similarity Blastall–BLASTn ftp://ftp.ncbi.nih.gov/blast/[93] 
Similarity BLAT http://genome-test.cse.ucsc.edu/ ∼kent/exe/[94] 
Similarity Blastall–megaBLAST ftp://ftp.ncbi.nih.gov/blast/[90] 
Pairwise distance TaxI [email protected] [97] 
Pairwise distance TaxonDNA http://taxondna.sf.net/[100] 
K2P distance MUSCLE, MEGA [email protected] [69] 
K2P distance BOLD-IDS http://www.barcodinglife.org/ views/idrequest.php[42] 
Patristic distance MrBayes, PAUP, APE, Perl scripts [email protected] [101] 
Phylogenetic Neighbour Joining MUSCLE, MEGA [email protected] [82] 
Parsimony MUSCLE, TNT [email protected] [90] 
Maximum likelihood MUSCLE, SPR1, PHYML2 http://atgc.lirmm.fr/spr/[82] 
Bayesian inference SAP http://fisher.berkeley.edu/cteg/ software/munch[102, 103] 
Coalescent based [email protected] [108] 
Coalescent based [email protected] [87] 
Coalescent based COALESCENCE, FLUCTUATE, PAUP, Seq-Gen [email protected] [88] 
Coalescent based COAL, MESQUITE [email protected] [109] 
Coalescent based general mixed Yule-coalescent (GMYC) model [email protected] [112] 
Character based Diagnostic CAOS http://www.genomecurator.org/ CAOS/CAOSindex.html[118] 
Diagnostic MATLAB, local perl scripts [email protected] [121] 
Diagnostic DNA-BAR (degenbar) http://dna.engr.uconn.edu/ ∼software/DNA-BAR/[120] 
Diagnostic DOME ID (local perl scripts) [email protected] [90] 
Combined Yule model/coalescence TCS, MEGA, Arlequin, PAUP, PAUPRat script, Phylip, r8s, R http://www.imedea.uib.es/ ∼jpons/JPWPhome.htm[124] 
BLAST/parsimony ratchet BLAST, MUSCLE, TNT [email protected] [90] 
BLAST/SPR BLAST, MUSCLE, SPR [email protected] [90] 
BLAST/Neighbour Joining BLAST, MUSCLE, neighbour [email protected] [90] 
Alignment-free Tree-based ATIM: TNT, local scripts [email protected] [90] 
Component vector CVTree alpha 1.0 http://cvtree.cbi.pku.edu.cn[127, 128] 
Spectrum kernel method Spectrum [email protected] [129] 
Web tool Web browser http://www.ibarcode.org[56] 
Web browser http://www.dnabarcodelinker.com/[56] 
Web browser http://www.asianbarcode.org/[85] 
Other ConFind, Python http://www.colorado.edu/ chemistry/RGHP/software/[119] 

Open in new tab

Single query, large dataset with good intra-specific sampling, or trivial identifications, no idea of sequence variability: methods based on threshold

These methods are based on the analysis of similarity among barcode sequences compared to a reference dataset. In a strict sense these methods follow a typological species concept, and discriminate entities exceeding a certain level of variability called threshold value. Threshold approaches rely on the assumption that intra-specific sequences variation does not exceed a certain distance value, otherwise they are considered as different species. In general, these methods perform DNA barcoding sensu stricto, and are usually chosen because they are faster, and require low knowledge on population structure or phylogenetic relationships. These methods can be considered the ‘first choice’ for new users approaching DNA barcoding. However, these methods imply the existence of a reference dataset, generated with the coordinated work of traditional and molecular taxonomists, to work. Hebert et al. [43] firstly proposed the use of a divergence threshold following the ‘10-fold rule’: the gap corresponds to a generic 10 times the value of intra-specific divergence. This rule has been deeply criticized (see for instance [2, 13]).

The BOLD system uses a threshold approach, allowing a Web user to perform species identification by querying, with an appropriate sequence (see question 5), the BOLD database: this tool is called identification system engine (BOLD-IDS [42]). BOLD-IDS is actually based on similarity methods and distance tree reconstruction (for details see [19, 86]). BOLD uses 1% of K2P distance as universal threshold value for metazoans discriminations [42]. At the analysis final stage, the query sequence is assigned to the species name of its nearest-neighbouring referenced sequence. BOLD-IDS is continuously upgraded and deals with the challenge to implement the better algorithms developed by the scientists. Statistical tests for species assignment and character-based clustering methods are the latter methods that BOLD is considering [52, 54, 87, 88]. Furthermore, in order to correctly manage intra-specific variability, it was proposed to complement the current data format for submission to BOLD with fields related to known presence of endosymbionts, insight on molecular clocks, populational genetic structure and geographical distribution [19, 89]. It must be underlined that these data could also be helpful if other analytical methods are implemented in BOLD.

Despite the initial success (see, e.g. [86]), pure distance-based methods could not be the most appropriate for species identification, because of several aspects clearly explained in [57], such as the ‘enchainment on the percent divergence’ [90]; the lack of a strong biological support [2]; the loss of character information [91]; the deep influence by incomplete taxonomic sampling both at species and intra-specific level [2] and by the chosen parameters for sequence alignment [92].

In the same group, there are similarity methods like BLAST [93], BLAT [94] or FASTA [95] that were largely used to infer similarity between a query sequence and barcode reference sequences. These methods deal with unaligned sequences in the reference database and use partial pairwise alignment or nearly exact matches of short strings (motifs) and are typically very fast in giving answers. In case of users dealing with very large datasets and if no higher precision is needed, similarity-based methods help you to rapidly analyse the datasets. Despite being extensively used in species identification (see for instance [68, 71]), approaches like BLAST were shown to cause incorrect or inconsistent identifications [96]. Steinke et al. [97] proposed a method, named TaxI, based on pairwise alignment of the query sequence on a prealigned reference database. Even if the method is fast, this approach has two clear limits: (i) variable regions are difficult to align, especially for large datasets; (ii) the query could equally match two different reference species.

With the aim of circumventing the ‘enchainment on the percent divergence’ problem, Baccam et al. [98] developed PAQ; Blaxter et al. [65] implemented a modified version of CLOBB [99] and Meier et al. [100] published TaxonDNA. Nevertheless, for DNA barcoding purposes all these methods have their limits: PAQ produces overlapping groups, TaxonDNA violates the threshold, whilst the modified CLOBB algorithm produces unstable groups [90].

In order to maximize the strength of coherence between traditional taxonomy and DNA-based identification, Lefébure et al. [101] formally tested the correlation between taxonomic ranks and genetic divergences toward the definition of a molecular threshold to help taxonomic decisions on a Crustacea dataset. Molecular distances were computed both pairwise and patristic. Similarly Ferri et al. [69] developed Perl scripts that allow to identify the optimal threshold value related to the minimum cumulative error (the minimum degree of discrepancies between two identification approaches) for rapid species diagnoses of filarial nematodes. Both these works reported that a threshold-based approach within coxI sequence variation is in global agreement with the current taxonomy of the taxa studied. If the users’ goal is to maximize the coherence between DNA barcoding and other identification systems, these integrated methods can be the best choice.

Concerning plants, Little and Stevenson [90] compared the performances of clustering, diagnostic and combined methods against similarity-based method on two gymnosperm datasets regarding a coding gene (matK) and a non-coding gene (ITS2). The results showed that the better values of accuracy to genus and species-level identification were reached with BLAST for the coding gene.

Small datasets relative to groups that experienced different evolutive histories: phylogenetic approach

These methods follow a phylogenetic species concept. In general, they are time consuming, because of high computational effort, and are directed to users acknowledged on the phylogenetic reconstruction techniques. These methods have been developed and proposed for DNA barcoding data analysis in order to overcome the limits of threshold-based approaches. However, the application of these approaches leads to some confusion in the relationships between DNA barcoding and molecular phylogeny (see also [13, 66]). Moreover, the publication of trees as the only output of DNA barcoding-tagged papers contributed to enhance and spread criticisms on the technique. It is worth remembering that DNA barcoding is not, in a strict sense, a phylogenetic reconstruction: as stated before (questions 1 and 2), to identify is different than to solve phylogenetic issues or to classify. These methods should be used by users with a sound background on the phylogeny versus identification debate in the DNA barcoding world.

The methods are here organized in two subclasses: (i) pure phylogenetic methods and (ii) methods based on the coalescent theory.

Small–medium-sized datasets: pure phylogenetic approaches

NJ, maximum parsimony (MP), maximum likelihood (ML) and Bayesian inference (BI) were extensively used to identify query sequences by the reconstruction of a set of topologies. In order to include the query sequence in a specific group it is crucial to identify group membership variables in the reconstructed hierarchy. This procedure is accomplished subjectively and consequently can cause disagreements in particular identification cases. Other problems can occur when: (i) a query sequence forms a new lineage; (ii) the gene tree does not match previous classification (e.g. caused by incomplete lineage sorting or conservation of ancestral polymorphisms); (iii) trees not well resolved impede unambiguous identification.

Elias et al. [82] analysed 58 species of tropical butterflies comparing similarity method (BLAST) with clustering methods (NJ, ML and BI). Their results showed almost equivalent accuracy for the four methods, despite BLAST and NJ performing considerably faster.

A statistical assignment of DNA sequences using BI has been developed by Munch et al. [102, 103]. The most important advancement of this method is the introduction of a statistical measure of confidence. Statistical assignment package (SAP) was developed to overcome limitations typical of both similarity and coalescent-based approaches (i.e. the request of extensive sampling, and multiple genes reconstructions). The authors proposed a purely phylogenetic solution, automatizing a Bayesian approach based on the positioning of the sample sequence to a particular clade in an established phylogeny with an associated measure of statistical confidence. However, the BI approach to tree sampling (necessary to obtain the statistically meaningful confidence) is computationally demanding, and for large datasets, alternative sampling approaches could be explored. In order to overcome this computational limit, Munch et al. [103] published an alternative approach consisting in a heuristic method for taxonomic rank assignment based on NJ and bootstrapping. This method could be interpreted as a rough and fast approximation to a full Bayesian approach.

It is known that in order to obtain robust phylogenetic reconstructions, the use of more taxa (see, for instance [104, 105]) can perform better than the use of more genes [106, 107]. In DNA barcoding the intrinsic need for large datasets causes high computational efforts that can be not easily supportable. Given these considerations, it is clear that phylogenetic-based approaches should implement heuristics and simplified analytical methods.

Small datasets and good intra-specific sampling: methods based on the coalescent theory

These methods rely on population genetics assumptions, and generally they require a large sampling, and consequently a considerable collection of DNA sequences for each biological entity. Population genetic theory shows that molecular threshold approaches do not take into account how the time since speciation influences patterns of genetic differentiation. In other words, there is a time lag between the speciation event and the observation of a reciprocal monophyly at a specific locus. Coalescent methods only take into account this time lag, introducing an essential correction in the analyses: differently from threshold-based methods, the ML inference based on coalescence should represent a more realistic species modelling [108].

Matz and Nielsen [108] proposed one of the first efforts to introduce statistical formalisms in DNA barcoding data analysis. Their tree-based method takes into account phylogenetic uncertainty and uses population genetic theory to determine cut-offs for species assignment in ambiguous cases. A likelihood ratio test allows to evaluate possible boundaries of intra-specific variation (for each species) on the basis of reference datasets using population genetic inferences based on coalescent theory (the Markov Chain Monte Carlo method was implemented). This should help to calculate divergence times and/or gene flow events. The main drawbacks of this method include the requirement of large intra-specific sampling (in order to extract population genetics data), intense computational times (as a direct consequence) and the use of a unique model of intra-specific variability that cannot be applied to all species.

The same authors in a different work analysed two datasets known to be critical for the presence of extremely low and high sequence variability (butterflies and frogs, respectively), at both intra- and inter-specific levels [87]. The authors implemented a K-test and Bayesian assignment, taking into account simultaneously phylogenetic information from all the (relevant) species in the database and population genetic information (including coalescent process within species). Abdo and Golding [88] replied with a faster and accurate model-based decision-theoretic framework (always based on the coalescent theory) that associates a degree of confidence to the species assignments. Compared to Nielsen and Matz [87], this approach allows to eliminate the need for multiple testing and to handle more than two populations at a time, thus speeding up the analyses. Given the sequences from members of a reference group and a query sequence from a new individual, Abdo and Golding [88] inferred the assignment of the new individual by using both distance and posterior probability of the reference group. Tested on both real and simulated datasets (the real dataset is the same used by Nielsen and Matz [87]), the coalescent assigner showed itself to be more powerful than a distance-based method in species identification with a small sampling size. Anyway, in cases of large sampling sizes and high rate of intra-specific substitutions, this method becomes computationally expensive. Moreover, the authors promised a new version that adds a measure of confidence to the assignments.

Knowles and Carstens [109] developed a probabilistic model to infer relationship between the gene trees and species history. Their preliminary study suggests that recently derived species can be also accurately identified long before the time necessary to achieve reciprocal monophyly. The model relies on previous works showing that gene genealogies provide information about the history of a species despite widespread incomplete lineage sorting [110, 111].

More recently, Monaghan et al. [112] developed a modified general mixed Yule-coalescent (GMYC) model for the analysis of high-throughput DNA sequencing, showing a convincing congruence with morphology (97%).

Large dataset and low intra-specific sampling: character-based approach

When species are identified through character states (i.e. presence/absence of discrete nucleotide substitutions) DNA barcoding implements character-based methods. Differently from distance- and classical phylogenetic-based approaches, character-based methods rely only on diagnostic sites, which being a small percentage of the total characters, make the application typically faster. These methods are considered consistent with the phylogenetic species concept [113] and can also handle other sources of data, such as morphological or ecological data. This integration leads to a synergy which has the advantage of minimizing the discrepancies between classical taxonomy and DNA barcoding. Character-based methods sidestep the distance ‘nearest neighbour problem’ by reconstructing hierarchical relationships (i.e. the common ancestor is inferred when two entities share derived characters). In addition, character-based methods are probably the best choice in case of datasets with few sequences for each taxonomic group. Users working on organisms for which sampling is difficult and limiting, should consider these methods as a valid choice.

Sarkar et al. [114, 115] developed a tool named CAOS (Characteristic Attributes Organization System) for placing a query sequence based on the presence of shared characters that are diagnostic for nodes on the tree. Assuming that molecular information should be an active component of modern taxonomy (but should not be the sole source of information), DeSalle et al. [52] presented an operational, integrative approach to taxonomy that could allow DNA barcoding to be consistent with classical taxonomy reconciling molecular data with other sources of characters. DeSalle et al. [52] implemented CAOS in order to identify ‘pure’, ‘private’ and ‘compound pure’ character states in three different datasets (mammalians, fishes and invertebrates) and integrated these DNA characters with geographical, ecological and morphological data.

Kelly et al. [116] performed a comparison between efficacy of BLAST, BOLD-IDS and CAOS on a molluscs dataset. Interestingly, despite CAOS appeared to be more robust to missing data or small datasets, all three methods revealed an accuracy of 100%.

Rach et al. [117] measured the potential and effectiveness of CAOS identifying diagnostics DNA barcoding at different taxonomic levels on a dataset of 833 insect specimens. The authors underlined that, differently from distance-based methods (where different specimens per species are needed to estimate intra-specific variability) DNA barcodes obtained from a single specimen can still be useful in the process of species identification through a diagnostic method.

Sarkar et al. [118] then implemented CAOS method, presenting a set of software tools to perform a character-based diagnosis that defines characteristic attributes (CA) for every clade at each branching node within a guide tree. This version of CAOS reduced computational costs considering only diagnostically informative CA. Current versions of CAOS are command-line applications, but Sarkar et al. [118] planned to publish a graphical Web interface for late 2009.

Even if not directly designed to work on DNA barcoding data, several applications like ConFind [119] can be used to identify conserved regions in multiple sequence alignments that can be used as diagnostic targets.

DasGupta et al. [120] published a software package, called DNA-BAR, for selecting DNA probes usable in molecular characterization of microorganisms. The script, called degenbar, works in an iterative way, finding a near-minimum number of distinguishers, allowing then the users to select sets of these distinguishers.

Little and Stevenson [90] modified the script degenbar [120] in order to generate a matrix of distinguishers (sequence strings) accepting a maximum of 10 000 nucleotides and proposed a tool, called Diagnostic Oligo Motifs for Explicit Identification (DOME ID) [90]. According to the authors DOME ID can be used when the user needs small amount of ambiguity.

Richardson et al. [121] developed a Perl script that implements the functions of MBEToolbox (a MATLAB script published by [122]) for analysis of a 96-well plate format sequences. Their script processes the *.phd.1 produced by the sequencer and takes into account Phred scores for diagnostic sites [123]. The results are files that can be opened in spreadsheet programs and show sequence quality, comparison data and suggested identifications. The incorporation of base-call quality scores (Phred values) allow the user to perform a unidirectional sequencing halving the sequencing costs.

Particular situations: combination of methods and avoiding alignments

Pons et al. [124] proposed a combined coalescent population model with a Yule model of speciation [125] that allows to define ‘species’ as cluster of specimens in a particular coalescent time frame.

Recently, a group of Web-based tools was made available to allow users to couple DNA barcoding and basic phylogenetic purposes. Some tools using the search algorithm called GoogleGene are now available: ‘iBarcode’ [56]; at http://www.ibarcode.org and ‘DNA Barcode Linker’ at http://www.dnabarcodelinker.com). These methods represent an appreciable attempt to support the DNA barcoding sensu lato philosophy, reconciling the systematics community.

Poor quality and the user influence on the alignment procedures could be responsible for the relatively poor performance of clustering hierarchical methods. In particular, NJ may be victim of low quality alignment. For these reasons, when large variable loci are present, it could be desirable to avoid methods that rely on alignments.

Since no universal alignment parameters are defined (hence gaps assignment into alignments is quite subjective, see [126]) and there is no consensus on what defines a good or a best alignment, Chu et al. [127, 128] explored the feasibility of grouping taxa based on component vector (CV) analysis that does not require alignment. The authors showed that the analysis of CVs can be a reliable clustering strategy for DNA barcoding purposes since their results were in global agreement with more sophisticated and alignment-based phylogenetic reconstructions. The method seems encouraging, but it should be tested on more challenging datasets.

In order to avoid alignments, Kuksa and Pavlovic [129] developed alignment-free methods leading to accurate and fast identifications.

Finally, ATIM is a hybrid method between hierarchical clustering and similarity methods. Each sequence in the reference database is scored for presence-absence of all possible short sequence motifs. The query is added and similarly scored, and subjected to cladistic analysis. Methods are available in Little and Stevenson [90].

Key Points

  • The present work moves from a striking request of the DNA barcoding world: the standardization of the method. We are not here proposing a unique way to treat data, but a shared way to approach a DNA barcoding analysis, by a six-question tour towards a better knowledge of the technique. On this basis the main points of our work are:

  • DNA barcoding implies two different approaches to discrimination (see question 1), but, in any case, it can be considered a catholic method to discriminate biological entities (see question 3).

  • In DNA barcoding molecular and biological entities are facing. This can generate problems, but bioinformatics tools allow us to cope with them. Some of the major criticisms of the method are only due to the unavoidable complexity of biological matters (see questions 2 and 6).

  • Bioinformatics plays a key role in supporting and consolidating DNA barcoding, being the core in the choice of the right primers (see question 4), in evaluating the sequence quality (see question 5), and obviously in data analysis (see question 6).


The authors are grateful to the ZooPlantLab staff, students and supporters. They are indebted to Neil Campbell for language editing and revision of the manuscript.


,  ,  , et al. 

Selecting barcoding loci for plants: evaluation of seven candidate loci with species level sampling in three divergent groups of land plants

Mol Ecol Res


, vol. 






,  ,  . 

DNA barcoding for ecologists

Trends Ecol Evol


, vol. 






Sours: https://academic.oup.com/bib/article/11/4/440/230525

DNA barcoding

Method of species identification using a short section of DNA

Not to be confused with the DNA barcode involved in optical mapping of DNA.

DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that, by comparison with a reference library of such DNA sections (also called "sequences"), an individual sequence can be used to uniquely identify an organism to species, in the same way that a supermarket scanner uses the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database.[1] These "barcodes" are sometimes used in an effort to identify unknown species, parts of an organism, or simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species boundaries.

Different gene regions are used to identify the different organismal groups using barcoding. The most commonly used barcode region for animals and some protists is a portion of the cytochrome c oxidase I (COI or COX1) gene, found in mitochondrial DNA. Other genes suitable for DNA barcoding are the internal transcribed spacer (ITS) rRNA often used for fungi and RuBisCO used for plants.[2][3]Microorganisms are detected using different gene regions. The 16S rRNA gene for example is widely used in identification of prokaryotes, whereas the 18S rRNA gene is mostly used for detecting microbial eukaryotes. These gene regions are chosen because they have less intraspecific (within species) variation than interspecific (between species) variation, which is known as the "Barcoding Gap".[4]

Some applications of DNA barcoding include: identifying plant leaves even when flowers or fruits are not available; identifying pollen collected on the bodies of pollinating animals; identifying insect larvae which may have fewer diagnostic characters than adults; or investigating the diet of an animal based on its stomach content, saliva or feces.[5] When barcoding is used to identify organisms from a sample containing DNA from more than one organism, the term DNA metabarcoding is used,[6][7] e.g. DNA metabarcoding of diatom communities in rivers and streams, which is used to assess water quality.[8]


DNA barcoding techniques were developed from early DNA sequencing work on microbial communities using the 5S rRNA gene.[9] In 2003, specific methods and terminology of modern DNA barcoding were proposed as a standardized method for identifying species, as well as potentially allocating unknown sequences to higher taxa such as orders and phyla, in a paper by Paul D.N. Hebert et al. from the University of Guelph, Ontario, Canada.[10] Hebert and his colleagues demonstrated the utility of the cytochrome c oxidase I (COI) gene, first utilized by Folmer et al. in 1994, using their published DNA primers as a tool for phylogenetic analyses at the species levels[10] as a suitable discriminatory tool between metazoan invertebrates.[11] The "Folmer region" of the COI gene is commonly used for distinction between taxa based on its patterns of variation at the DNA level. The relative ease of retrieving the sequence, and variability mixed with conservation between species, are some of the benefits of COI. Calling the profiles "barcodes", Hebert et al. envisaged the development of a COI database that could serve as the basis for a "global bioidentification system".


Sampling and preservation[edit]

Barcoding can be done from tissue from a target specimen, from a mixture of organisms (bulk sample), or from DNA present in environmental samples (e.g. water or soil). The methods for sampling, preservation or analysis differ between those different types of sample.

Tissue samples

To barcode a tissue sample from the target specimen, a small piece of skin, a scale, a leg or antenna is likely to be sufficient (depending on the size of the specimen). To avoid contamination, it is necessary to sterilize used tools between samples. It is recommended to collect two samples from one specimen, one to archive, and one for the barcoding process. Sample preservation is crucial to overcome the issue of DNA degradation.

Bulk samples

A bulk sample is a type of environmental sample containing several organisms from the taxonomic group under study. The difference between bulk samples (in the sense used here) and other environmental samples is that the bulk sample usually provides a large quantity of good-quality DNA.[12] Examples of bulk samples include aquatic macroinvertebrate samples collected by kick-net, or insect samples collected with a Malaise trap. Filtered or size-fractionated water samples containing whole organisms like unicellular eukaryotes are also sometimes defined as bulk samples. Such samples can be collected by the same techniques used to obtain traditional samples for morphology-based identification.

eDNA samples

The environmental DNA (eDNA) method is a non-invasive approach to detect and identify species from cellular debris or extracellular DNA present in environmental samples (e.g. water or soil) through barcoding or metabarcoding. The approach is based on the fact that every living organism leaves DNA in the environment, and this environmental DNA can be detected even for organisms that are at very low abundance. Thus, for field sampling, the most crucial part is to use DNA-free material and tools on each sampling site or sample to avoid contamination, if the DNA of the target organism(s) is likely to be present in low quantities. On the other hand, an eDNA sample always includes the DNA of whole-cell, living microorganisms, which are often present in large quantities. Therefore, microorganism samples taken in the natural environment also are called eDNA samples, but contamination is less problematic in this context due to the large quantity of target organisms. The eDNA method is applied on most sample types, like water, sediment, soil, animal feces, stomach content or blood from e.g. leeches.[13]

DNA extraction, amplification and sequencing[edit]

DNA barcoding requires that DNA in the sample is extracted. Several different DNA extraction methods exist, and factors like cost, time, sample type and yield affect the selection of the optimal method.

When DNA from organismal or eDNA samples is amplified using polymerase chain reaction (PCR), the reaction can be affected negatively by inhibitor molecules contained in the sample.[14] Removal of these inhibitors is crucial to ensure that high quality DNA is available for subsequent analyzing.

Amplification of the extracted DNA is a required step in DNA barcoding. Typically, only a small fragment of the total DNA material is sequenced (typically 400–800 base pairs)[15] to obtain the DNA barcode. Amplification of eDNA material is usually focused on smaller fragment sizes (<200 base pairs), as eDNA is more likely to be fragmented than DNA material from other sources. However, some studies argue that there is no relationship between amplicon size and detection rate of eDNA.[16][17]

HiSeq sequencers at SciLIfeLab in Uppsala, Sweden. The photo was taken during the excursion of SLU course PNS0169 in March 2019.

When the DNA barcode marker region has been amplified, the next step is to sequence the marker region using DNA sequencing methods.[18] Many different sequencing platforms are available, and technical development is proceeding rapidly.

Marker selection[edit]

A schematic view of primers and target region, demonstrated on 16S rRNA gene in Pseudomonas. As primers, one typically selects short conserved sequences with low variability, which can thus amplify most or all species in the chosen target group. The primers are used to amplify a highly variable target region in between the two primers, which is then used for species discrimination. Modified from »Variable Copy Number, Intra-Genomic Heterogeneities and Lateral Transfers of the 16S rRNA Gene in Pseudomonas« by Bodilis, Josselin; Nsigue-Meilo, Sandrine; Besaury, Ludovic; Quillet, Laurent, used under CC BY, available from: https://www.researchgate.net/figure/Hypervariable-regions-within-the-16S-rRNA-gene-in-Pseudomonas-The-plotted-line-reflects_fig2_224832532.

Markers used for DNA barcoding are called barcodes. In order to successfully characterize species based on DNA barcodes, selection of informative DNA regions is crucial. A good DNA barcode should have low intra-specific and high inter-specific variability[10] and possess conserved flanking sites for developing universal PCRprimers for wide taxonomic application. The goal is to design primers that will detect and distinguish most or all the species in the studied group of organisms (high taxonomic resolution). The length of the barcode sequence should be short enough to be used with current sampling source, DNA extraction, amplification and sequencing methods.[19]

Ideally, one gene sequence would be used for all taxonomic groups, from viruses to plants and animals. However, no such gene region has been found yet, so different barcodes are used for different groups of organisms,[20] or depending on the study question.

For animals, the most widely used barcode is mitochondrialcytochrome C oxidase I (COI) locus.[21] Other mitochondrial genes, such as Cytb, 12S or 18S are also used. Mitochondrial genes are preferred over nuclear genes because of their lack of introns, their haploid mode of inheritance and their limited recombination.[21][22] Moreover, each cell has various mitochondria (up to several thousand) and each of them contains several circular DNA molecules. Mitochondria can therefore offer abundant source of DNA even when sample tissue is limited.[20]

In plants, however, mitochondrial genes are not appropriate for DNA barcoding because they exhibit low mutation rates.[23] A few candidate genes have been found in the chloroplast genome, the most promising being maturase K gene (matK) by itself or in association with other genes. Multi-locus markers such as ribosomal internal transcribed spacers (ITS DNA) along with matK, rbcL, trnH or other genes have also been used for species identification.[20] The best discrimination between plant species has been achieved when using two or more chloroplast barcodes.[24]

For bacteria, the small subunit of ribosomal RNA (16S) gene can be used for different taxa, as it is highly conserved.[25] Some studies suggest COI,[26] type II chaperonin (cpn60)[27] or β subunit of RNA polymerase (rpoB)[28] also could serve as bacterial DNA barcodes.

Barcoding fungi is more challenging, and more than one primer combination might be required.[29] The COI marker performs well in certain fungi groups,[30] but not equally well in others.[31] Therefore, additional markers are being used, such as ITS rDNA and the large subunit of nuclear ribosomal RNA (28S LSU rRNA).[32]

Within the group of protists, various barcodes have been proposed, such as the D1–D2 or D2–D3 regions of 28S rDNA, V4 subregion of 18S rRNA gene, ITS rDNA and COI. Additionally, some specific barcodes can be used for photosynthetic protists, for example the large subunit of ribulose-1,5-bisphosphate carboxylase-oxygenase gene (rbcL) and the chloroplastic23S rRNA gene.[20]

Organism group Marker gene/locus
Animals COI,[33] Cytb,[34] 12S,[35] 16S[36]
Plants matK,[37]rbcL,[38]psbA-trnH,[39]ITS[40]
Bacteria COI,[26]rpoB,[28] 16S,[41]cpn60,[27]tuf,[42]RIF,[43]gnd[44]
Fungi ITS,[2][45]TEF1α,[46][47]RPB1, RPB2,18S,[32]28S[48]
Protists ITS,[49]COI,[50]rbcL,[51]18S,[52]28S,[51] 23S[20]

Reference libraries and bioinformatics[edit]

Reference libraries are used for the taxonomic identification, also called annotation, of sequences obtained from barcoding or metabarcoding. These databases contain the DNA barcodes assigned to previously identified taxa. Most reference libraries do not cover all species within an organism group, and new entries are continually created. In the case of macro- and many microorganisms (such as algae), these reference libraries require detailed documentation (sampling location and date, person who collected it, image, etc.) and authoritative taxonomic identification of the voucher specimen, as well as submission of sequences in a particular format. However, such standards are fulfilled for only a small number of species. The process also requires the storage of voucher specimens in museum collections, herbaria and other collaborating institutions. Both taxonomically comprehensive coverage and content quality are important for identification accuracy.[53] In the microbial world, there is no DNA information for most species names, and many DNA sequences cannot be assigned to any Linnaean binomial.[54] Several reference databases exist depending on the organism group and the genetic marker used. There are smaller, national databases (e.g. FinBOL), and large consortia like the International Barcode of Life Project (iBOL).[55]


Launched in 2007, the Barcode of Life Data System (BOLD)[56] is one of the biggest databases, containing more than 450 000 BINs (Barcode Index Numbers) in 2019. It is a freely accessible repository for the specimen and sequence records for barcode studies, and it is also a workbench aiding the management, quality assurance and analysis of barcode data. The database mainly contains BIN records for animals based on the COI genetic marker.


The UNITE database[57] was launched in 2003 and is a reference database for the molecular identification of fungal species with the internal transcribed spacer (ITS) genetic marker region. This database is based on the concept of species hypotheses: you choose the % at which you want to work, and the sequences are sorted in comparison to sequences obtained from voucher specimens identified by experts.


Diat.barcode[58] database was first published under the name R-syst::diatom[59] in 2016 starting with data from two sources: the Thonon culture collection (TCC) in the hydrobiological station of the French National Institute for Agricultural Research (INRA), and from the NCBI (National Center for Biotechnology Information) nucleotide database. Diat.barcode provides data for two genetic markers, rbcL (Ribulose-1,5-bisphosphate carboxylase/oxygenase) and 18S (18S ribosomal RNA). The database also involves additional, trait information of species, like morphological characteristics (biovolume, size dimensions, etc.), life-forms (mobility, colony-type, etc.) or ecological features (pollution sensitivity, etc.).

Bioinformatic analysis[edit]

In order to obtain well structured, clean and interpretable data, raw sequencing data must be processed using bioinformatic analysis. The FASTQ file with the sequencing data contains two types of information: the sequences detected in the sample (FASTA file) and a quality file with quality scores (PHRED scores) associated with each nucleotide of each DNA sequence. The PHRED scores indicate the probability with which the associated nucleotide has been correctly scored.

10 90%
20 99%
30 99.9%
40 99.99%
50 99.999%

In general, the PHRED score decreases towards the end of each DNA sequence. Thus some bioinformatics pipelines simply cut the end of the sequences at a defined threshold.

Some sequencing technologies, like MiSeq, use paired-end sequencing during which sequencing is performed from both directions producing better quality. The overlapping sequences are then aligned into contigs and merged. Usually, several samples are pooled in one run, and each sample is characterized by a short DNA fragment, the tag. In a demultiplexing step, sequences are sorted using these tags to reassemble the separate samples. Before further analysis, tags and other adapters are removed from the barcoding sequence DNA fragment. During trimming, the bad quality sequences (low PHRED scores), or sequences that are much shorter or longer than the targeted DNA barcode, are removed. The following dereplication step is the process where all of the quality-filtered sequences are collapsed into a set of unique reads (individual sequence units ISUs) with the information of their abundance in the samples. After that, chimeras (i.e. compound sequences formed from pieces of mixed origin) are detected and removed. Finally, the sequences are clustered into OTUs (Operational Taxonomic Units), using one of many clustering strategies. The most frequently used bioinformatic software include Mothur,[60] Uparse,[61] Qiime,[62] Galaxy,[63] Obitools,[64] JAMP,[65] Barque,[66] and DADA2.[67]

Comparing the abundance of reads, i.e. sequences, between different samples is still a challenge because both the total number of reads in a sample as well as the relative amount of reads for a species can vary between samples, methods, or other variables. For comparison, one may then reduce the number of reads of each sample to the minimal number of reads of the samples to be compared – a process called rarefaction. Another way is to use the relative abundance of reads.[68]

Species identification and taxonomic assignment[edit]

The taxonomic assignment of the OTUs to species is achieved by matching of sequences to reference libraries. The Basic Local Alignment Search Tool (BLAST) is commonly used to identify regions of similarity between sequences by comparing sequence reads from the sample to sequences in reference databases.[69] If the reference database contains sequences of the relevant species, then the sample sequences can be identified to species level. If a sequence cannot be matched to an existing reference library entry, DNA barcoding can be used to create a new entry.

In some cases, due to the incompleteness of reference databases, identification can only be achieved at higher taxonomic levels, such as assignment to a family or class. In some organism groups such as bacteria, taxonomic assignment to species level is often not possible. In such cases, a sample may be assigned to a particular operational taxonomic unit (OTU).


Applications of DNA barcoding include identification of new species, safety assessment of food, identification and assessment of cryptic species, detection of alien species, identification of endangered and threatened species,[70] linking egg and larval stages to adult species, securing intellectual property rights for bioresources, framing global management plans for conservation strategies and elucidate feeding niches.[71] DNA barcode markers can be applied to address basic questions in systematics, ecology, evolutionary biology and conservation, including community assembly, species interaction networks, taxonomic discovery, and assessing priority areas for environmental protection.

Identification of species[edit]

Specific short DNA sequences or markers from a standardized region of the genome can provide a DNA barcode for identifying species.[72] Molecular methods are especially useful when traditional methods are not applicable. DNA barcoding has great applicability in identification of larvae for which there are generally few diagnostic characters available, and in association of different life stages (e.g. larval and adult) in many animals.[73] Identification of species listed in the Convention of the International Trade of Endangered Species (CITES) appendixes using barcoding techniques is used in monitoring of illegal trade.[74]

Detection of invasive species[edit]

Alien species can be detected via barcoding.[75][76] Barcoding can be suitable for detection of species in e.g. border control, where rapid and accurate morphological identification is often not possible due to similarities between different species, lack of sufficient diagnostic characteristics[75] and/or lack of taxonomic expertise. Barcoding and metabarcoding can also be used to screen ecosystems for invasive species, and to distinguish between an invasive species and native, morphologically similar, species.[77]

Delimiting cryptic species[edit]

DNA barcoding enables the identification and recognition of cryptic species.[78] The results of DNA barcoding analyses depend however upon the choice of analytical methods, so the process of delimiting cryptic species using DNA barcodes can be as subjective as any other form of taxonomy. Hebert et al. (2004) concluded that the butterfly Astraptes fulgerator in north-western Costa Rica actually consists of 10 different species.[79] These results, however, were subsequently challenged by Brower (2006), who pointed out numerous serious flaws in the analysis, and concluded that the original data could support no more than the possibility of three to seven cryptic taxa rather than ten cryptic species.[80] Smith et al. (2007) used cytochrome c oxidase I DNA barcodes for species identification of the 20 morphospecies of Belvosia parasitoid flies (Diptera: Tachinidae) reared from caterpillars (Lepidoptera) in Area de Conservación Guanacaste (ACG), northwestern Costa Rica. These authors discovered that barcoding raises the species count to 32, by revealing that each of the three parasitoid species, previously considered as generalists, actually are arrays of highly host-specific cryptic species.[81] For 15 morphospecies of polychaetes within the deep Antarcticbenthos studied through DNA barcoding, cryptic diversity was found in 50% of the cases. Furthermore, 10 previously overlooked morphospecies were detected, increasing the total species richness in the sample by 233%.[82]

Barcoding is a tool to vouch for food quality. Here, DNA from traditional Norwegian Christmas food is extracted at the molecular systematic lab at NTNU University Museum.

Diet analysis and food web application[edit]

DNA barcoding and metabarcoding can be useful in diet analysis studies,[83] and is typically used if prey specimens cannot be identified based on morphological characters.[84][85] There is a range of sampling approaches in diet analysis: DNA metabarcoding can be conducted on stomach contents,[86] feces,[85][87] saliva[88] or whole body analysis.[70][89] In fecal samples or highly digested stomach contents, it is often not possible to distinguish tissue from single species, and therefore metabarcoding can be applied instead.[85][90] Feces or saliva represent non-invasive sampling approaches, while whole body analysis often means that the individual needs to be killed first. For smaller organisms, sequencing for stomach content is then often done by sequencing the entire animal.

Barcoding for food safety[edit]

DNA barcoding represents an essential tool to evaluate the quality of food products. The purpose is to guarantee food traceability, to minimize food piracy, and to valuate local and typical agro-food production. Another purpose is to safeguard public health; for example, metabarcoding offers the possibility to identify groupers causing Ciguatera fish poisoning from meal remnants,[91] or to separate poisonous mushrooms from edible ones (Ref).

Biomonitoring and ecological assessment[edit]

DNA barcoding can be used to assess the presence of endangered species for conservation efforts (Ref), or the presence of indicator species reflective to specific ecological conditions (Ref), for example excess nutrients or low oxygen levels.

Potentials and shortcomings[edit]


Traditional bioassessment methods are well established internationally, and serve biomonitoring well, as for example for aquatic bioassessment within the EU Directives WFD and MSFD. However, DNA barcoding could improve traditional methods for the following reasons; DNA barcoding (i) can increase taxonomic resolution and harmonize the identification of taxa which are difficult to identify or lack experts, (ii) can more accurately/precisely relate environmental factors to specific taxa (iii) can increase comparability among regions, (iv) allows for the inclusion of early life stages and fragmented specimens, (v) allows delimitation of cryptic/rare species (vi) allows for development of new indices e.g. rare/cryptic species which may be sensitive/tolerant to stressors, (vii) increases the number of samples which can be processed and reduces processing time resulting in increased knowledge of species ecology, (viii) is a non-invasive way of monitoring when using eDNA methods.[92]

Time and cost[edit]

DNA barcoding is faster than traditional morphological methods all the way from training through to taxonomic assignment. It takes less time to gain expertise in DNA methods than becoming an expert in taxonomy. In addition, the DNA barcoding workflow (i.e. from sample to result) is generally quicker than traditional morphological workflow and allows the processing of more samples.

Taxonomic resolution[edit]

DNA barcoding allows the resolution of taxa from higher (e.g. family) to lower (e.g. species) taxonomic levels, that are otherwise too difficult to identify using traditional morphological methods, like e.g. identification via microscopy. For example, Chironomidae (the non-biting midge) are widely distributed in both terrestrial and freshwater ecosystems. Their richness and abundance make them important for ecological processes and networks, and they are one of many invertebrate groups used in biomonitoring. Invertebrate samples can contain as many as 100 species of chironomids which often make up as much as 50% of a sample. Despite this, they are usually not identified below the family level because of the taxonomic expertise and time required.[93] This may result in different chironomid species with different ecological preferences grouped together, resulting in inaccurate assessment of water quality.

DNA barcoding provides the opportunity to resolve taxa, and directly relate stressor effects to specific taxa such as individual chironomid species. For example, Beermann et al. (2018) DNA barcoded Chironomidae to investigate their response to multiple stressors; reduced flow, increased fine-sediment and increased salinity.[94] After barcoding, it was found that the chironomid sample consisted of 183 Operational Taxonomic Units (OTUs), i.e. barcodes (sequences) that are often equivalent to morphological species. These 183 OTUs displayed 15 response types rather than the previously reported [95] two response types recorded when all chironomids were grouped together in the same multiple stressor study. A similar trend was discovered in a study by Macher et al. (2016) which discovered cryptic diversity within the New Zealand mayfly species Deleatidium sp. This study found different response patterns of 12 molecular distinct OTUs to stressors which may change the consensus that this mayfly is sensitive to pollution.[96]


Despite the advantages offered by DNA barcoding, it has also been suggested that DNA barcoding is best used as a complement to traditional morphological methods.[92] This recommendation is based on multiple perceived challenges.

Physical parameters[edit]

It is not completely straightforward to connect DNA barcodes with ecological preferences of the barcoded taxon in question, as is needed if barcoding is to be used for biomonitoring. For example, detecting target DNA in aquatic systems depends on the concentration of DNA molecules at a site, which in turn can be affected by many factors. The presence of DNA molecules also depends on dispersion at a site, e.g. direction or strength of currents. It is not really known how DNA moves around in streams and lakes, which makes sampling difficult. Another factor might be the behavior of the target species, e.g. fish can have seasonal changes of movements, crayfish or mussels will release DNA in larger amounts just at certain times of their life (moulting, spawning). For DNA in soil, even less is known about distribution, quantity or quality.

The major limitation of the barcoding method is that it relies on barcode reference libraries for the taxonomic identification of the sequences. The taxonomic identification is accurate only if a reliable reference is available. However, most databases are still incomplete, especially for smaller organisms e.g. fungi, phytoplankton, nematoda etc. In addition, current databases contain misidentifications, spelling mistakes and other errors. There is massive curation and completion effort around the databases for all organisms necessary, involving large barcoding projects (for example the iBOL project for the Barcode of Life Data Systems (BOLD) reference database).[97][98] However, completion and curation are difficult and time-consuming. Without vouchered specimens, there can be no certainty about whether the sequence used as a reference is correct.

DNA sequence databases like GenBank contain many sequences that are not tied to vouchered specimens (for example, herbarium specimens, cultured cell lines, or sometimes images). This is problematic in the face of taxonomic issues such as whether several species should be split or combined, or whether past identifications were sound. Reusing sequences, not tied to vouchered specimens, of initially misidentified organism may support incorrect conclusions and must be avoided.[99] Therefore, best practice for DNA barcoding is to sequence vouchered specimens.[100][101] For many taxa, it can be however difficult to obtain reference specimens, for example with specimens that are difficult to catch, available specimens are poorly conserved, or adequate taxonomic expertise is lacking.[99]

Importantly, DNA barcodes can also be used to create interim taxonomy, in which case OTUs can be used as substitutes for traditional Latin binomials – thus significantly reducing dependency on fully populated reference databases.[102]

Technological bias[edit]

DNA barcoding also carries methodological bias, from sampling to bioinformatics data analysis. Beside the risk of contamination of the DNA sample by PCR inhibitors, primer bias is one of the major sources of errors in DNA barcoding.[103][104] The isolation of an efficient DNA marker and the design of primers is a complex process and considerable effort has been made to develop primers for DNA barcoding in different taxonomic groups.[105] However, primers will often bind preferentially to some sequences, leading to differential primer efficiency and specificity and unrepresentative communities’ assessment and richness inflation.[106] Thus, the composition of the sample's communities sequences is mainly altered at the PCR step.  Besides, PCR replication is often required, but leads to an exponential increase in the risk of contamination. Several studies have highlighted the possibility to use mitochondria-enriched samples [107][108] or PCR-free approaches to avoid these biases, but as of today, the DNA metabarcoding technique is still based on the sequencing of amplicons.[105] Other bias enter the picture during the sequencing and during the bioinformatic processing of the sequences, like the creation of chimeras.

Lack of standardization[edit]

Even as DNA barcoding is more widely used and applied, there is no agreement concerning the methods for DNA preservation or extraction, the choices of DNA markers and primers set, or PCR protocols. The parameters of bioinformatics pipelines (for example OTU clustering, taxonomic assignment algorithms or thresholds etc.) are at the origin of much debate among DNA barcoding users.[105] Sequencing technologies are also rapidly evolving, together with the tools for the analysis of the massive amounts of DNA data generated, and standardization of the methods is urgently needed to enable collaboration and data sharing at greater spatial and time-scale. This standardisation of barcoding methods at the European scale is part of the objectives of the European COST Action DNAqua-net [109] and is also addressed by CEN (the European Committee for Standardization).[110]

Another criticism of DNA barcoding is its limited efficiency for accurate discrimination below species level (for example, to distinguish between varieties), for hybrid detection, and that it can be affected by evolutionary rates (Ref needed).

Mismatches between conventional (morphological) and barcode based identification[edit]

It is important to know that taxa lists derived by conventional (morphological) identification are not, and maybe never will be, directly comparable to taxa lists derived from barcode based identification because of several reasons. The most important cause is probably the incompleteness and lack of accuracy of the molecular reference databases preventing a correct taxonomic assignment of eDNA sequences. Taxa not present in reference databases will not be found by eDNA, and sequences linked to a wrong name will lead to incorrect identification.[92] Other known causes are a different sampling scale and size between a traditional and a molecular sample, the possible analysis of dead organisms, which can happen in different ways for both methods depending on organism group, and the specific selection of identification in either method, i.e. varying taxonomical expertise or possibility to identify certain organism groups, respectively primer bias leading also to a potential biased analysis of taxa.[92]

Estimates of richness/diversity[edit]

DNA Barcoding can result in an over or underestimate of species richness and diversity. Some studies suggest that artifacts (identification of species not present in a community) are a major cause of inflated biodiversity.[111][112] The most problematic issue are taxa represented by low numbers of sequencing reads. These reads are usually removed during the data filtering process, since different studies suggest that most of these low-frequency reads may be artifacts.[113] However, real rare taxa may exist among these low-abundance reads.[114] Rare sequences can reflect unique lineages in communities which make them informative and valuable sequences. Thus, there is a strong need for more robust bioinformatics algorithms that allow the differentiation between informative reads and artifacts. Complete reference libraries would also allow a better testing of bioinformatics algorithms, by permitting a better filtering of artifacts (i.e. the removal of sequences lacking a counterpart among extant species) and therefore, it would be possible obtain a more accurate species assignment.[115] Cryptic diversity can also result in inflated biodiversity as one morphological species may actually split into many distinct molecular sequences.[92]


Differences in the standard methods for DNA barcoding and metabarcoding. While DNA barcoding points to find a specific species, metabarcoding looks for the whole community.

Main article: Metabarcoding

Metabarcoding is defined as the barcoding of DNA or eDNA (environmental DNA) that allows for simultaneous identification of many taxa within the same (environmental) sample, however often within the same organism group. The main difference between the approaches is that metabarcoding, in contrast to barcoding, does not focus on one specific organism, but instead aims to determine species composition within a sample.


The metabarcoding procedure, like general barcoding, covers the steps of DNA extraction, PCR amplification, sequencing and data analysis. A barcode consists of a short variable gene region (for example, see different markers/barcodes) which is useful for taxonomic assignment flanked by highly conserved gene regions which can be used for primer design.[12] Different genes are used depending if the aim is to barcode single species or metabarcoding several species. In the latter case, a more universal gene is used. Metabarcoding does not use single species DNA/RNA as a starting point, but DNA/RNA from several different organisms derived from one environmental or bulk sample.


Metabarcoding has the potential to complement biodiversity measures, and even replace them in some instances, especially as the technology advances and procedures gradually become cheaper, more optimized and widespread.[116][117]

DNA metabarcoding applications include:

Advantages and challenges[edit]

The general advantages and shortcomings for barcoding reviewed above are valid also for metabarcoding. One particular drawback for metabarcoding studies is that there is no consensus yet regarding the optimal experimental design and bioinformatics criteria to be applied in eDNA metabarcoding.[118] However, there are current joined attempts, like e.g. the EU COST network DNAqua-Net, to move forward by exchanging experience and knowledge to establish best-practice standards for biomonitoring.[92]

See also[edit]


Related topics:

Also see the sidebar navigation at the top of the article.


  1. ^"What is DNA Barcoding?". iBOL. Retrieved 2019-03-26.
  2. ^ abSchoch, Conrad L.; Seifert, Keith A.; Huhndorf, Sabine; Robert, Vincent; Spouge, John L.; Levesque, C. André; Chen, Wen; Fungal Barcoding Consortium (2012). "Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi"(PDF). Proceedings of the National Academy of Sciences. 109 (16): 6241–6246. doi:10.1073/pnas.1117018109. ISSN 0027-8424. PMC 3341068. PMID 22454494.
  3. ^CBOL Plant Working Group; Hollingsworth, P. M.; Forrest, L. L.; Spouge, J. L.; Hajibabaei, M.; Ratnasingham, S.; van der Bank, M.; Chase, M. W.; Cowan, R. S. (2009-08-04). "A DNA barcode for land plants". Proceedings of the National Academy of Sciences. 106 (31): 12794–12797. Bibcode:2009PNAS..10612794H. doi:10.1073/pnas.0905845106. ISSN 0027-8424. PMC 2722355. PMID 19666622.
  4. ^Paulay, Gustav; Meyer, Christopher P. (2005-11-29). "DNA Barcoding: Error Rates Based on Comprehensive Sampling". PLOS Biology. 3 (12): e422. doi:10.1371/journal.pbio.0030422. ISSN 1545-7885. PMC 1287506. PMID 16336051.
  5. ^Soininen, Eeva M; Valentini, Alice; Coissac, Eric; Miquel, Christian; Gielly, Ludovic; Brochmann, Christian; Brysting, Anne K; Sønstebø, Jørn H; Ims, Rolf A (2009). "Analysing diet of small herbivores: the efficiency of DNA barcoding coupled with high-throughput pyrosequencing for deciphering the composition of complex plant mixtures". Frontiers in Zoology. 6 (1): 16. doi:10.1186/1742-9994-6-16. ISSN 1742-9994. PMC 2736939. PMID 19695081.
  6. ^Creer, Simon; Deiner, Kristy; Frey, Serita; Porazinska, Dorota; Taberlet, Pierre; Thomas, W. Kelley; Potter, Caitlin; Bik, Holly M. (2016). Freckleton, Robert (ed.). "The ecologist's field guide to sequence-based identification of biodiversity"(PDF). Methods in Ecology and Evolution. 7 (9): 1008–1018. doi:10.1111/2041-210X.12574.
  7. ^"ScienceDirect". Advances in Ecological Research. 58: 63–99. January 2018. doi:10.1016/bs.aecr.2018.01.001. hdl:1822/72852.
  8. ^Vasselon, Valentin; Rimet, Frédéric; Tapolczai, Kálmán; Bouchez, Agnès (2017). "Assessing ecological status with diatoms DNA metabarcoding: Scaling-up on a WFD monitoring network (Mayotte island, France)". Ecological Indicators. 82: 1–12. doi:10.1016/j.ecolind.2017.06.024. ISSN 1470-160X.
  9. ^Woese, Carl R.; Kandler, Otto; Wheelis, Mark L. (1990). "Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya"(PDF). Proceedings of the National Academy of Sciences. 87 (12): 4576–4579. Bibcode:1990PNAS...87.4576W. doi:10.1073/pnas.87.12.4576. OCLC 678728346. PMC 54159. PMID 2112744.
  10. ^ abcHebert, Paul D. N.; Cywinska, Alina; Ball, Shelley L.; deWaard, Jeremy R. (2003-02-07). "Biological identifications through DNA barcodes". Proceedings of the Royal Society B: Biological Sciences. 270 (1512): 313–321. doi:10.1098/rspb.2002.2218. ISSN 1471-2954. PMC 1691236. PMID 12614582.
  11. ^Folmer, O.; Black, M.; Hoeh, W.; Lutz, R.; Vrijenhoek, R. (October 1994). "DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates". Molecular Marine Biology and Biotechnology. 3 (5): 294–299. ISSN 1053-6426. PMID 7881515.
  12. ^ abPierre, Taberlet (2018-02-02). Environmental DNA : for biodiversity research and monitoring. Bonin, Aurelie, 1979-. Oxford. ISBN . OCLC 1021883023.
  13. ^Jelger Herder; A. Valentini; E. Bellemain; T. Dejean; J.J.C.W. Van Delft; P.F. Thomsen; P. Taberlet (2014), Environmental DNA - a review of the possible applications for the detection of (invasive) species., RAVON, doi:10.13140/rg.2.1.4002.1208
  14. ^Schrader, C Then it can be this way because of DNA.; Schielke, A.; Ellerbroek, L.; Johne, R. (2012). "PCR inhibitors – occurrence, properties and removal". Journal of Applied Microbiology. 113 (5): 1014–1026. doi:10.1111/j.1365-2672.2012.05384.x. ISSN 1365-2672. PMID 22747964. S2CID 30892831.
  15. ^Savolainen, Vincent; Cowan, Robyn S; Vogler, Alfried P; Roderick, George K; Lane, Richard (2005-10-29). "Towards writing the encyclopaedia of life: an introduction to DNA barcoding". Philosophical Transactions of the Royal Society B: Biological Sciences. 360 (1462): 1805–1811. doi:10.1098/rstb.2005.1730. ISSN 0962-8436. PMC 1609222. PMID 16214739.
  16. ^Piggott, Maxine P. (2016). "Evaluating the effects of laboratory protocols on eDNA detection probability for an endangered freshwater fish". Ecology and Evolution. 6 (9): 2739–2750. doi:10.1002/ece3.2083. ISSN 2045-7758. PMC 4798829. PMID 27066248.
  17. ^Ma, Hongjuan; Stewart, Kathryn; Lougheed, Stephen; Zheng, Jinsong; Wang, Yuxiang; Zhao, Jianfu (2016). "Characterization, optimization, and validation of environmental DNA (eDNA) markers to detect an endangered aquatic mammal". Conservation Genetics Resources. 8 (4): 561–568. doi:10.1007/s12686-016-0597-9. ISSN 1877-7252. S2CID 1613649.
  18. ^D’Amore, Rosalinda; Ijaz, Umer Zeeshan; Schirmer, Melanie; Kenny, John G.; Gregory, Richard; Darby, Alistair C.; Shakya, Migun; Podar, Mircea; Quince, Christopher (2016-01-14). "A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling". BMC Genomics. 17 (1): 55. doi:10.1186/s12864-015-2194-9. ISSN 1471-2164. PMC 4712552. PMID 26763898.
  19. ^Kress, W. J.; Erickson, D. L. (2008-02-26). "DNA barcodes: Genes, genomics, and bioinformatics". Proceedings of the National Academy of Sciences. 105 (8): 2761–2762. Bibcode:2008PNAS..105.2761K. doi:10.1073/pnas.0800476105. ISSN 0027-8424. PMC 2268532. PMID 18287050.
  20. ^ abcdefPurty RS, Chatterjee S. "DNA Barcoding: An Effective Technique in Molecular Taxonomy". Austin Journal of Biotechnology & Bioengineering. 3 (1): 1059.
  21. ^ abHebert, Paul D.N.; Ratnasingham, Sujeevan; de Waard, Jeremy R. (2003-08-07). "Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species". Proceedings of the Royal Society B: Biological Sciences. 270 (suppl_1): S96-9. doi:10.1098/rsbl.2003.0025. ISSN 1471-2954. PMC 1698023. PMID 12952648.
  22. ^Blaxter, Mark L. (2004-04-29). Godfray, H. C. J.; Knapp, S. (eds.). "The promise of a DNA taxonomy". Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences. 359 (1444): 669–679. doi:10.1098/rstb.2003.1447. ISSN 1471-2970. PMC 1693355. PMID 15253352.
  23. ^Fazekas, Aron J.; Burgess, Kevin S.; Kesanakurti, Prasad R.; Graham, Sean W.; Newmaster, Steven G.; Husband, Brian C.; Percy, Diana M.; Hajibabaei, Mehrdad; Barrett, Spencer C. H. (2008-07-30). DeSalle, Robert (ed.). "Multiple Multilocus DNA Barcodes from the Plastid Genome Discriminate Plant Species Equally Well". PLOS ONE. 3 (7): e2802. Bibcode:2008PLoSO...3.2802F. doi:10.1371/journal.pone.0002802. ISSN 1932-6203. PMC 2475660. PMID 18665273.
  24. ^Kress, W. John; Erickson, David L. (2007-06-06). Shiu, Shin-Han (ed.). "A Two-Locus Global DNA Barcode for Land Plants: The Coding rbcL Gene Complements the Non-Coding trnH-psbA Spacer Region". PLOS ONE. 2 (6): e508. Bibcode:2007PLoSO...2..508K. doi:10.1371/journal.pone.0000508. ISSN 1932-6203. PMC 1876818. PMID 17551588.
  25. ^Janda, J. M.; Abbott, S. L. (2007-09-01). "16S rRNA Gene Sequencing for Bacterial Identification in the Diagnostic Laboratory: Pluses, Perils, and Pitfalls". Journal of Clinical Microbiology. 45 (9): 2761–2764. doi:10.1128/JCM.01228-07. ISSN 0095-1137. PMC 2045242. PMID 17626177.
  26. ^ abSmith, M. Alex; Bertrand, Claudia; Crosby, Kate; Eveleigh, Eldon S.; Fernandez-Triana, Jose; Fisher, Brian L.; Gibbs, Jason; Hajibabaei, Mehrdad; Hallwachs, Winnie (2012-05-02). Badger, Jonathan H. (ed.). "Wolbachia and DNA barcoding insects: Patterns, potential, and problems". PLOS ONE. 7 (5): e36514. Bibcode:2012PLoSO...736514S. doi:10.1371/journal.pone.0036514. ISSN 1932-6203. PMC 3342236. PMID 22567162.
  27. ^ abLinks, Matthew G.; Dumonceaux, Tim J.; Hemmingsen, Sean M.; Hill, Janet E. (2012-11-26). Neufeld, Josh (ed.). "The Chaperonin-60 Universal Target Is a Barcode for Bacteria That Enables De Novo Assembly of Metagenomic Sequence Data". PLOS ONE. 7 (11): e49755. Bibcode:2012PLoSO...749755L. doi:10.1371/journal.pone.0049755. ISSN 1932-6203. PMC 3506640. PMID 23189159.
  28. ^ abCase, R. J.; Boucher, Y.; Dahllof, I.; Holmstrom, C.; Doolittle, W. F.; Kjelleberg, S. (2007-01-01). "Use of 16S rRNA and rpoB Genes as Molecular Markers for Microbial Ecology Studies". Applied and Environmental Microbiology. 73 (1): 278–288. doi:10.1128/AEM.01177-06. ISSN 0099-2240. PMC 1797146. PMID 17071787.
  29. ^Bellemain, Eva; Carlsen, Tor; Brochmann, Christian; Coissac, Eric; Taberlet, Pierre; Kauserud, Håvard (2010). "ITS as an environmental DNA barcode for fungi: an in silico approach reveals potential PCR biases". BMC Microbiology. 10 (1): 189. doi:10.1186/1471-2180-10-189. ISSN 1471-2180. PMC 2909996. PMID 20618939.
  30. ^Seifert, K. A.; Samson, R. A.; deWaard, J. R.; Houbraken, J.; Levesque, C. A.; Moncalvo, J.-M.; Louis-Seize, G.; Hebert, P. D. N. (2007-03-06). "Prospects for fungus identification using CO1 DNA barcodes, with Penicillium as a test case". Proceedings of the National Academy of Sciences. 104 (10): 3901–3906. doi:10.1073/pnas.0611691104. ISSN 0027-8424. PMC 1805696. PMID 17360450.
  31. ^Dentinger, Bryn T. M.; Didukh, Maryna Y.; Moncalvo, Jean-Marc (2011-09-22). Schierwater, Bernd (ed.). "Comparing COI and ITS as DNA Barcode Markers for Mushrooms and Allies (Agaricomycotina)". PLOS ONE. 6 (9): e25081. Bibcode:2011PLoSO...625081D. doi:10.1371/journal.pone.0025081. ISSN 1932-6203. PMC 3178597. PMID 21966418.
  32. ^ abKhaund, Polashree; Joshi, S.R. (October 2014). "DNA barcoding of wild edible mushrooms consumed by the ethnic tribes of India". Gene. 550 (1): 123–130. doi:10.1016/j.gene.2014.08.027. PMID 25130907.
  33. ^Lobo, Jorge; Costa, Pedro M; Teixeira, Marcos AL; Ferreira, Maria SG; Costa, Maria H; Costa, Filipe O (2013). "Enhanced primers for amplification of DNA barcodes from a broad range of marine metazoans". BMC Ecology. 13 (1): 34. doi:10.1186/1472-6785-13-34. ISSN 1472-6785. PMC 3846737. PMID 24020880.
  34. ^Yacoub, Haitham A.; Fathi, Moataz M.; Sadek, Mahmoud A. (2015-03-04). "Using cytochrome b gene of mtDNA as a DNA barcoding marker in chicken strains". Mitochondrial DNA. 26 (2): 217–223. doi:10.3109/19401736.2013.825771. ISSN 1940-1736. PMID 24020964. S2CID 37802920.
  35. ^Siddappa, Chandra Mohan; Saini, Mohini; Das, Asit; Doreswamy, Ramesh; Sharma, Anil K.; Gupta, Praveen K. (2013). "Sequence characterization of mitochondrial 12S rRNA gene in mouse deer (Moschiola indica) for PCR-RFLP based species identification". Molecular Biology International. 2013: 783925. doi:10.1155/2013/783925. ISSN 2090-2182. PMC 3885226. PMID 24455258.
  36. ^Vences, Miguel; Thomas, Meike; van der Meijden, Arie; Chiari, Ylenia; Vieites, David R. (2005-03-16). "Comparative performance of the 16S rRNA gene in DNA barcoding of amphibians". Frontiers in Zoology. 2 (1): 5. doi:10.1186/1742-9994-2-5. ISSN 1742-9994. PMC 555853. PMID 15771783.
  37. ^Chen, Shilin; Yao, Hui; Han, Jianping; Liu, Chang; Song, Jingyuan; Shi, Linchun; Zhu, Yingjie; Ma, Xinye; Gao, Ting (2010-01-07). Gilbert, M. Thomas P (ed.). "Validation of the ITS2 Region as a Novel DNA Barcode for Identifying Medicinal Plant Species". PLOS ONE. 5 (1): e8613. Bibcode:2010PLoSO...5.8613C. doi:10.1371/journal.pone.0008613. ISSN 1932-6203. PMC 2799520. PMID 20062805.
  38. ^Theodoridis, Spyros; Stefanaki, Anastasia; Tezcan, Meltem; Aki, Cuneyt; Kokkini, Stella; Vlachonasios, Konstantinos E. (July 2012). "DNA barcoding in native plants of the Labiatae (Lamiaceae) family from Chios Island (Greece) and the adjacent Çeşme-Karaburun Peninsula (Turkey)". Molecular Ecology Resources. 12 (4): 620–633. doi:10.1111/j.1755-0998.2012.03129.x. PMID 22394710. S2CID 2227349.
  39. ^Yang, Ying; Zhai, Yanhong; Liu, Tao; Zhang, Fangming; Ji, Yunheng (January 2011). "Detection of Valeriana jatamansi as an adulterant of medicinal Paris by length variation of chloroplast psbA-trnH region"(PDF). Planta Medica. 77 (1): 87–91. doi:10.1055/s-0030-1250072. ISSN 0032-0943. PMID 20597045.
  40. ^Gao, Ting; Yao, Hui; Song, Jingyuan; Liu, Chang; Zhu, Yingjie; Ma, Xinye; Pang, Xiaohui; Xu, Hongxi; Chen, Shilin (July 2010). "Identification of medicinal plants in the family Fabaceae using a potential DNA barcode ITS2". Journal of Ethnopharmacology. 130 (1): 116–121. doi:10.1016/j.jep.2010.04.026. PMID 20435122.
  41. ^Weisburg WG; Barns SM; Pelletier DA; Lane DJ (1991). "16S ribosomal DNA amplification for phylogenetic study". Journal of Bacteriology. 173 (2): 697–703. doi:10.1128/jb.173.2.697-703.1991. PMC 207061. PMID 1987160.
  42. ^Makarova, Olga; Contaldo, Nicoletta; Paltrinieri, Samanta; Kawube, Geofrey; Bertaccini, Assunta; Nicolaisen, Mogens (2012-12-18). Woo, Patrick CY. (ed.). "DNA Barcoding for Identification of 'Candidatus Phytoplasmas' Using a Fragment of the Elongation Factor Tu Gene". PLOS ONE. 7 (12): e52092. Bibcode:2012PLoSO...752092M. doi:10.1371/journal.pone.0052092. ISSN 1932-6203. PMC 3525539. PMID 23272216.
  43. ^Schneider, Kevin L.; Marrero, Glorimar; Alvarez, Anne M.; Presting, Gernot G. (2011-04-21). Bereswill, Stefan (ed.). "Classification of Plant Associated Bacteria Using RIF, a Computationally Derived DNA Marker". PLOS ONE. 6 (4): e18496. Bibcode:2011PLoSO...618496S. doi:10.1371/journal.pone.0018496. ISSN 1932-6203. PMC 3080875. PMID 21533033.
  44. ^Liu, Lin; Huang, Xiaolei; Zhang, Ruiling; Jiang, Liyun; Qiao, Gexia (January 2013). "Phylogenetic congruence between Mollitrichosiphum (Aphididae: Greenideinae) and Buchnera indicates insect-bacteria parallel evolution". Systematic Entomology. 38 (1): 81–92. doi:10.1111/j.1365-3113.2012.00647.x. S2CID 84702103.
  45. ^Gao, Ruifang; Zhang, Guiming (November 2013). "Potential of DNA Barcoding for Detecting Quarantine Fungi". Phytopathology. 103 (11): 1103–1107. doi:10.1094/PHYTO-12-12-0321-R. ISSN 0031-949X. PMID 23718836.
  46. ^Stielow, J. B.; Lévesque, C. A.; Seifert, K. A.; Meyer, W.; Irinyi, L.; Smits, D.; Renfurm, R.; Verkley, G. J. M.; Groenewald, M.; Chaduli, D.; Lomascolo, A.; Welti, S.; Lesage-Meessen, L.; Favel, A.; Al-Hatmi, A. M. S.; Damm, U.; Yilmaz, N.; Houbraken, J.; Lombard, L.; Quaedvlieg, W.; Binder, M.; Vaas, L. A. I.; Vu, D.; Yurkov, A.; Begerow, D.; Roehl, O.; Guerreiro, M.; Fonseca, A.; Samerpitak, K.; van Diepeningen, A. D.; Dolatabadi, S.; Moreno, L. F.; Casaregola, S.; Mallet, S.; Jacques, N.; Roscini, L.; Egidi, E.; Bizet, C.; Garcia-Hermoso, D.; Martín, M. P.; Deng, S.; Groenewald, J. Z.; Boekhout, T.; de Beer, Z. W.; Barnes, I.; Duong, T. A.; Wingfield, M. J.; de Hoog, G. S.; Crous, P. W.; Lewis, C. T.; Hambleton, S.; Moussa, T. A. A.; Al-Zahrani, H. S.; Almaghrabi, O. A.; Louis-Seize, G.; Assabgui, R.; McCormick, W.; Omer, G.; Dukik, K.; Cardinali, G.; Eberhardt, U.; de Vries, M.; Robert, V. (2015). "One fungus, which genes? Development and assessment of universal primers for potential secondary fungal DNA barcodes". Persoonia. 35: 242–263. doi:10.3767/003158515X689135. PMC 4713107. PMID 26823635.
  47. ^Meyer, Wieland; Irinyi, Laszlo; Minh, Thuy Vi Hoang; Robert, Vincent; Garcia-Hermoso, Dea; Desnos-Ollivier, Marie; Yurayart, Chompoonek; Tsang, Chi-Ching; Lee, Chun-Yi; Woo, Patrick C. Y.; Pchelin, Ivan Mikhailovich; Uhrlaß, Silke; Nenoff, Pietro; Chindamporn, Ariya; Chen, Sharon; Hebert, Paul D. N.; Sorrell, Tania C.; ISHAM barcoding of pathogenic fungi working group (2018). "Database establishment for the secondary fungal DNA barcode translational elongation factor 1α (TEF1α)". Genome. 62 (3): 160–169. doi:10.1139/gen-2018-0083. PMID 30465691.
  48. ^Brown, Shawn P.; Rigdon-Huss, Anne R.; Jumpponen, Ari (June 2014). "Analyses of ITS and LSU gene regions provide congruent results on fungal community responses". Fungal Ecology. 9: 65–68. doi:10.1016/j.funeco.2014.02.002.
  49. ^Gile, Gillian H.; Stern, Rowena F.; James, Erick R.; Keeling, Patrick J. (August 2010). "DNA barcoding of Chlorarachniophytes using nucleomorph ITS sequences". Journal of Phycology. 46 (4): 743–750. doi:10.1111/j.1529-8817.2010.00851.x. S2CID 26529105.
  50. ^Strüder-Kypke, Michaela C.; Lynn, Denis H. (2010-03-25). "Comparative analysis of the mitochondrial cytochrome c oxidase subunit I (COI) gene in ciliates (Alveolata, Ciliophora) and evaluation of its suitability as a biodiversity marker". Systematics and Biodiversity. 8 (1): 131–148. doi:10.1080/14772000903507744. ISSN 1477-2000. S2CID 83996912.
  51. ^ abHamsher, Sarah E.; LeGresley, Murielle M.; Martin, Jennifer L.; Saunders, Gary W. (2013-10-09). Crandall, Keith A. (ed.). "A comparison of morphological and molecular-based surveys to estimate the species richness of Chaetoceros and Thalassiosira (Bacillariophyta), in the Bay of Fundy". PLOS ONE. 8 (10): e73521. Bibcode:2013PLoSO...873521H. doi:10.1371/journal.pone.0073521. ISSN 1932-6203. PMC 3794052. PMID 24130665.
  52. ^Kaczmarska, Irena; Ehrman, James Michael; Moniz, Monica Barros Joyce; Davidovich, Nikolai (September 2009). "Phenotypic and genetic structure of interbreeding populations of the diatom Tabularia fasciculata (Bacillariophyta)". Phycologia. 48 (5): 391–403. doi:10.2216/08-74.1. ISSN 0031-8884. S2CID 84919305.
  53. ^Weigand, Hannah; Beermann, Arne J.; Čiampor, Fedor; Costa, Filipe O.; Csabai, Zoltán; Duarte, Sofia; Geiger, Matthias F.; Grabowski, Michał; Rimet, Frédéric (2019-03-14). "DNA barcode reference libraries for the monitoring of aquatic biota in Europe: Gap-analysis and recommendations for future work". bioRxiv. 678: 499–524. Bibcode:2019ScTEn.678..499W. doi:10.1101/576553. hdl:11250/2608962. PMID 31077928. S2CID 92160002.
Sours: https://en.wikipedia.org/wiki/DNA_barcoding

Barcoding dna


The Using DNA Barcodes to Identify and Classify Living Things laboratory demonstrates several important concepts of modern biology. During the course of this laboratory, you will:

  • Collect and analyze sequence data from plants, fungi, or animals – or products made from them.
  • Use DNA sequence to identify species.
  • Explore relationships between species.

In addition, the laboratory experiment utilizes several experimental and bioinformatics methods in modern biological research. You will:

  • Collect plants, animals, or products in your local environment or neighborhood.
  • Extract and purify DNA from tissue or processed material.
  • Amplify a specific region of the chloroplast, mitochondrial, or nuclear genome by polymerase chain reaction (PCR), and analyze PCR products by gel electrophoresis.
  • Use the Basic Local Alignment Search Tool (BLAST) to identify sequences in databases.
  • Use multiple sequence alignment and tree-building tools to analyze phylogenetic relationships.


Taxonomy, the science of classifying living things according to shared features, has always been a part of human society. Carl Linneas formalized biological classification with his system of binomial nomenclature that assigns each organism a genus and species name.

Identifying organisms has grown in importance as we monitor the biological effects of global climate change and attempt to preserve species diversity in the face of accelerating habitat destruction. We know very little about the diversity of plants and animals – let alone microbes – living in many unique ecosystems on earth. Less than two million of the estimated 5-50 million plant and animal species have been identified. Scientists agree that the yearly rate of extinction has increased from about one species per million to 100-1,000 per million. This means that thousands of plants and animals are lost each year. Most of these have not yet been identified.

Classical taxonomy falls short in this race to catalog biological diversity before it disappears. Specimens must be carefully collected and handled to preserve their distinguishing features. Differentiating subtle anatomical differences between closely related species requires the subjective judgment of a highly trained specialist – and few are being produced in colleges today.

Now, DNA barcodes allow non-experts to objectively identify species – even from small, damaged, or industrially processed material. Just as the unique pattern of bars in a universal product code (UPC) identifies each consumer product, a “DNA barcode” is a unique pattern of DNA sequence that can potentially identify each living thing. Short DNA barcodes, about 700 nucleotides in length, can be quickly processed from thousands of specimens and unambiguously analyzed by computer programs.

DNA barcoding revealed that what was once thought to be one species of butterfly is really ten species with caterpillars that eat different plants.

The International Barcode of Life (iBOL) organizes collaborators from more than 150 countries to participate in a variety of “campaigns” to census diversity among plant and animal groups – including ants, bees, butterflies, fish, birds, mammals, fungi, and flowering plants – and within ecosystems – including the seas, poles, rain forests, kelp forests, and coral reefs. The 10-year Census of Marine Life, completed in 2010, provided the first comprehensive list of more than 190,000 marine species and identified 6,000 potentially new species.

There is a surprising level of biological diversity, literally in front of our eyes. For example, DNA barcodes showed that a well-known skipper butterfly (Astraptes fulgerator), identified in 1775, is actually ten distinct species. DNA barcodes have revolutionized the classification of orchids, a complex and widespread plant family with an estimated 20,000 members. The urban environment is also unexpectedly diverse; DNA barcodes were used to catalogue 54 species of bees and 24 species of butterflies in community gardens in New York City.

DNA barcodes are also used to detect food fraud and products taken from conserved species. Working with researchers from Rockefeller University and the American Museum of Natural History, students from Trinity High School found that 25% of 60 seafood items purchased in grocery stores and restaurants in New York City were mislabeled as more expensive species. One mislabeled fish was the endangered species, Acadian redfish. Another group identified three protected whale species as the source of sushi sold in California and Korea. However, using DNA barcodes to identify potential biological contraband among products seized by customs is now well established.

Barcoding relies on short, highly variably regions of the genome. Although there is no universal barcode, a growing list of variable regions can help differentiate species from diverse taxonomic groups. With thousands of copies per cell, mitochondrial and chloroplast sequences are readily amplified by polymerase chain reaction, even from very small or degraded specimens. Regions of chloroplast genes, including rbcL (RuBisCo—Ribulose-1,5-bisphosphate carboxylase oxygenase—large subunit) and matK (maturase K) are used for barcoding plants. The most abundant protein on earth, RuBisCo catalyzes the first step of carbon fixation, while maturase K encodes a protein that assists RNA editing. A region of the mitochondrial gene COI (cytochrome c oxidase subunit I) is used for barcoding animals. COI is involved in the electron transport phase of respiration. Thus, many genes used for barcoding are involved in the key reactions of life: storing energy in carbohydrates and releasing it to form ATP. COI in fungi and lichens is difficult to amplify, insufficiently variable, and some fungal groups lack mitochondria. Instead, the nuclear internal transcribed spacer (ITS), a variable region that surrounds the 5.8s ribosomal RNA gene, is targeted. Like organelle genes, there are many copies of ITS per genome, and the variability in fungi and lichens allows for their identification. The ITS region is also used for barcoding plants when rbcL and matK do not work. Some organisms need other taxa-specific primers for identification. For instance, green macroalgae lack matKand are difficult to barcode with rbcL and ITS. For these plants, another chloroplast gene, tufA, which codes for elongation factor Tu (EF-Tu), involved in protein synthesis, can be used. DNA barcoding to the species level is sometimes difficult with a single barcode, as species may share identical barcodes. Using multiple barcoding regions can help differentiate these closely related species.

This laboratory uses DNA barcoding to identify plants, fungi, or animals – or products made from them. First, a sample of tissue is collected, preserving the specimen whenever possible and noting its geographical location and local environment. A small leaf disc, a whole insect, or samples of muscle are suitable sources. DNA is extracted from the tissue sample, and the barcode portion of the rbcL, COI and ITS gene is amplified by PCR. The amplified sequence (amplicon) is submitted for sequencing in one or both directions.

The sequencing results are then used to search a DNA database. A close match quickly identifies a species that is already represented in the database. However, some barcodes will be entirely new, and identification may rely on placing the unknown species in a phylogenetic tree with near relatives. Novel DNA barcodes can be submitted to GenBank® (http://www.ncbi.nlm.nih.gov).

Further Reading

  • Hebert P.D., Cywinska A., Ball S.L., deWaard J.R. (2003). Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences 270(1512): 313-21.
  • Hebert P.D.N., Penton E.H., Burns J.M., Janzen D.H., Hallwachs W. (2004). Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proc Natl Acad Sci USA. 101(41):14812-7.
  • Hollingsworth P.M. et al (2009). A DNA barcode for land plants. Proc Natl Acad Sci USA 106(31): 12794-7.
  • Ratnasingham, S., Hebert, P.D.N (2007). BOLD: The Barcode of Life Data System. Molecular Ecology Notes 7(3): 355-64.
  • Stoeckle M. (2003). Taxonomy, DNA, and the Bar Code of Life. BioScience 53(9): 2-3.
  • Van Den Berg C., Higgins W.E., Dressler R.L., Whitten W.M., Soto-Arenas M.A., Chase M.W. (2009) A phylogenetic study of Laeliinae (Orchidaceae) based on combined nuclear and plastid DNA sequences. Annals of Botany 104(3): 417-30.
  • Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I, Lipman D.J., Ostell J., Sayers E.W. (2013). Nucleic Acids Res. GenBank. 41(D1): D36–D42.
Sours: https://dnabarcoding101.org/lab/

DNA barcodes: Genes, genomics, and bioinformatics


1. Savolainen V, Cowan RS, Vogler AP, Roderick GK, Lane R. Phil Trans R Soc London Ser B. 2005;360:1805–1811.[PMC free article] [PubMed] [Google Scholar]

2. Lahaye R, van der Bank M, Bogarin D, Warner J, Pupulin F, Gigot G, Maurin O, Duthoit S, Barraclough TG, Savolainen V. Proc Natl Acad Sci USA. 2008;105:2923–2928.[PMC free article] [PubMed] [Google Scholar]

3. Kembel SW, Hubbell SP. Ecology. 2006;87:S86–S99. [PubMed] [Google Scholar]

4. Westoby M, Wright IJ. Trends Ecol Evol. 2006;21:261–268. [PubMed] [Google Scholar]

5. Webb CO, Ackerly DD, McPeek MA, Donoghue MJ. Annu Rev Ecol Syst. 2002;33:475–505.[Google Scholar]

6. van Straalen NM, Roelofs D. An Introduction to Ecological Genomics. London: Oxford Univ Press; 2006. [Google Scholar]

7. Hebert PDN, Ratnasingham S, deWaard JR. Proc Roy Soc B. 2003;270(suppl):S96–SS99.[PMC free article] [PubMed] [Google Scholar]

8. Cho Y, Mower JP, Qiu YL, Palmer JD. Proc Natl Acad Sci USA. 2004;101:17741–17746.[PMC free article] [PubMed] [Google Scholar]

9. Pennisi E. Science. 2007;318:190–191. [PubMed] [Google Scholar]

10. Kress WJ, Wurdack KJ, Zimmer EA, Weigt LA, Janzen DH. Proc Natl Acad Sci USA. 2005;102:8369–8374.[PMC free article] [PubMed] [Google Scholar]

11. Taberlet P, Coissac E, Pompanon F, Gielly L, Miquel C, Valentini A, Vermat T, Corthier G, Brochmann C, Willerslev E. Nucleic Acids Res. 2007;35(3):e14.[PMC free article] [PubMed] [Google Scholar]

12. Chase MW, Cowan RS, Hollingsworth PM, van den Berg C, Madriñán S, Petersen G, Seberg O, Jorgsensen T, Cameron KN, Carine M, et al. Taxon. 2007;56:295–299.[Google Scholar]

13. Kress WJ, Erickson DL. PLoS ONE. 2007;2(6):e508.[PMC free article] [PubMed] [Google Scholar]

14. Sass C, Little DP, Stevenson DW, Specht CD. PLoS ONE. 2007;2(11):e1154.[PMC free article] [PubMed] [Google Scholar]

15. Ekrem T, Willassen E, Stur E. Mol Phylogenet Evol. 2007;43:530–542. [PubMed] [Google Scholar]

16. Shaw J, Lickey EB, Schilling EE, Small RL. Am J Bot. 2007;94:275–288. [PubMed] [Google Scholar]

17. Little D, Stevenson DW. Cladistics. 2006;22:1–21.[Google Scholar]

Sours: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2268532/

Now discussing:

Counting the species: how DNA barcoding is rewriting the book of life

Guanacaste conservation area in north-west Costa Rica is the most DNA barcoded place on Earth. On its western frontier, jaguars hunt turtles from the mangrove swamps that line the Pacific coast. Endangered spider monkeys swing through dry tropical forest, the remnants of a rapidly disappearing ecosystem that once ran from northern Mexico to Panama.

Updated version of map

On the slopes of volcanoes, the last before Lake Nicaragua to the north, rainforest covers the land. High on the volcanic peaks, cool, moist air brought by the Atlantic trade winds forms cloud forests. There is a lot of life to document in this world heritage site, which is roughly the size of New York City.

As the sixth mass extinction of life on Earth gathers pace, humanity can only manage a well-informed guess about the true magnitude of the loss. We have identified around 2 million species on the planet. We know their abundance has plummeted. But with estimates for the total ranging from 8.7 million to a trillion, we are still unable to answer a fundamental question: how many species are there on Earth?

Until recently, there was little hope of a quick solution to the so-called “taxonomic impediment”, the phrase used to describe our inadequate account of the world’s library of life and the scarcity of taxonomists. Detailed species knowledge was routinely lost when experts died. Most plants and animals that went extinct slipped away unnoticed and unrecorded, anonymous casualties of human overconsumption and overpopulation.

But that was before the invention of DNA barcoding. In 2003, Canadian scientist Paul Hebert published a study claiming to have developed a technique that could identify and differentiate between all animal species on Earth. Using common moths collected in his own backyard, he successfully identified 200 closely related species using the mitochondrial gene cytochrome c oxidase I (COI), which is present in all aerobic life.

Hebert, known primarily for his expertise in water fleas at the time, had cracked it. The short genetic sequence would serve as a DNA barcode for all animals, separating species by their genetic divergence. An equivalent section of DNA could be used to discriminate between plants and fungi. Museum collections could be identified, too. Barcoding was also cheap. All he needed now was $1bn to identify the millions of animals unknown to science, a fraction of the cost of the International Space Station or the Human Genome Project, the paper concluded. But Hebert’s study was not met with universal acclaim.

“I was surprised. I had anticipated harsh criticism from morphologists. But I had not expected critiques from my peers in evolutionary biology,” Hebert recounts. He was accused of acting like a “creationist”. Others said his findings were uninteresting.

But nearly two decades later DNA barcoding has become mainstream. In August, Hebert, a professor at the University of Guelph in Ontario, was awarded the prestigious Midori prize for creating “a research alliance which is revolutionising our understanding of planetary biodiversity”.

DNA barcoding has been used to track the illegal trade in wildlife and plants, monitor water quality and even uncover the sale of endangered sharks in fish and chips. The technique has unmasked so-called cryptic species that were identified as one animal by traditional taxonomic approaches but are in fact many distinct creatures that appear the same to the human eye.

So far the reference library of species overseen by the International Barcode of Life (iBOL), where Hebert is the scientific director, numbers around 750,000 species. Last year the group launched a $180m project to barcode two million more species around the world, approximately the total number of flora and fauna already described using traditional taxonomy. While estimates for the number of plants, animals and fungi species range from eight to 20 million, insects are believed to account for a huge number of undiscovered species. Around $60m has been raised for the project so far.


The benefits of knowing the Earth’s library of life are not limited to understanding the extent of biodiversity loss. Discoveries in medicine, agriculture, food, engineering and even beauty products are hidden in the genomes of the species that will be barcoded. A complete library of life could underpin food distribution networks, allow a smartphone attachment to identify any piece of organic material on Earth and integrate natural history into the social, cultural and economic fabric of society.

Now Hebert has turned his attention to the creation of a global biosurveillance system underpinned by barcoding that will continuously monitor the planet and check the health of global ecosystems in near real time. A network of satellites, underwater drones and DNA sequencers would patrol Earth, alerting scientists and governments to any dangerous changes, intercepting new diseases and highlighting harmful human activity. He estimates it would cost $1bn over 20 years.

Every species, that’s a book of life, and it’s about 10 times bigger than the longest book ever written by any human
Prof Paul Hebert

There are good reasons to create such a system. Compared with the atmospheric monitoring infrastructure and billions of research money for combating the climate crisis, the resources dedicated to measuring the ongoing biological annihilation of life on Earth are pitiful. The tale of our heating planet is based on more than 150 years of weather records, while it is not uncommon for studies on insect collapse to be based on figures compiled by amateur entomologists.

“It’s a million centuries between every mass extinction event and we’re living in the century that brings the next one,” Hebert says. “We’re talking about the irrevocable loss of knowledge on the largest scale ever experienced by humanity – driven by humanity. Because every one of those genomes and every one of those species: that’s a book of life, and it’s about 10 times bigger than the longest book ever written by any human. So I think history will indict us severely for allowing this erosion of knowledge on an absolutely unprecedented scale.”


Guanacaste conservation area World Heritage site (ACG in its Spanish acronym) exists largely thanks to a lifetime of work by University of Pennsylvania professors Daniel Janzen and Winnie Hallwachs. Dan and Winnie, as they are known to everyone, split their lives between Philadelphia and a forest cabin in Santa Rosa national park, which is part of the ACG. They immediately understood the potential of Hebert’s innovation and are the major drivers behind Costa Rica’s bid to become the most extensively barcoded country on Earth with a new project: BioAlfa.

“For me, the invention of DNA barcoding is easily as significant as the discovery of DNA,” Janzen says as we sit outside their forest home. “And you could even go further back to some bigger discovery that we’ve had – the microscope, for example.” The 81-year-old evolutionary ecologist is a generational talent in his field, recipient of the Crafoord prize and a MacArthur fellow. His peers also admire his bravery, hard work and excellent salesmanship.

BioAlfa aims to systematically record and describe all of Costa Rica’s biodiversity, with barcoding at its heart. In 2019, President Carlos Alvarado Quesada designated the scheme of national importance, but it still needs $100m to make its goals a self-sustaining reality. While temperate countries have launched similar schemes, the sheer abundance of life in the tropics makes BioAlfa a completely different challenge.

The Central American country is home to an estimated 4% of the world’s biodiversity. Coexistence with nature is part of Costa Rica’s essence and it promotes ambitious decarbonisation plans and wields international influence in the environmental arena. Overcoming the taxonomic impediment within its borders by identifying and understanding all of its flora and flora would be an unprecedented achievement. Hebert has reserved half of his barcoding capacity for BioAlfa this year.

Janzen began documenting life in the dry tropical forests of northern Costa Rica after collecting leaves to feed Rufus, an excitable teenage tapir, in the mid-70s. The pig-like herbivore had been orphaned and entrusted to friends, surviving on scraps from the kitchen table. But Rufus was no longer welcome at dinner after he learned that a swift tug on the tablecloth would bring a feast crashing to the floor.

“When he was banished outside, I came to the question of what kinds of leaves he would eat,” Janzen says, chuckling as he recounts the tale.

Janzen drove to the forest of Santa Rosa national park, which now forms part of the ACG, and filled plastic bags with an array of leaves for Rufus. But when he returned to the corral, he realised he could not identify the leaves the grateful tapir was devouring. So he returned to Santa Rosa with a botanist and spent the next six months identifying the plants in the forests. Then he moved on to insects.

‘Butterfly factories’

A network of malaise traps, moth lamp stations and rearing barns – jokingly known as “butterfly factories” to those who work in them – has been established across the different ecosystems to record insect life. The painstaking research will help make a global biosurveillance system possible but it needs to be conducted everywhere.

A former water buffalo shed is filled with carefully organised rows of plastic bags, each containing a caterpillar feasting on leaves from the nearby rainforest. Osvaldo, a former shark fisherman and field assistant to the couple for 30 years, is holding a hungry caterpillar hidden under a leaf. The insect will be carefully reared and ultimately sent to Canada for DNA barcoding analysis in Hebert’s lab once it has completed its life cycle.

The caterpillar is a chaotic creature that writhes in the light when Osvaldo turns over the leaf, the end of its body quivering like the rattle of a poisonous snake. Its shades of brown and beige combine like a cubist artwork. Barcoding might show it is a new species.

“There are much bigger ones than that,” Osvaldo tells me, disappearing back into the lines of plastic bags.

The next caterpillar is huge, covered in orange and blue spikes. It makes a low-level, muttered clicking sound as Osvaldo strokes its back. We cross to the other side of the rearing barn to inspect pupae undergoing their final stage of development. Osvaldo delights in the range of chrysalis shapes and colours.

But not all become butterflies and moths. The bags filled with dead pupae are moved to another line in the barn. From them, parasitoids emerge from eggs that were laid inside the unsuspecting hosts they have slowly devoured.

In the main building on a hill above the rearing barn, Gloria, another parataxonimist, shows me photos of how the pupae are changed by the parasitoids. Some look like they’ve been stuffed with polystyrene. Others look normal apart from small groups of black bubbles on the pupa. Sometimes flies emerge from them, sometimes wasps.

Gloria is carefully inspecting glass jars filled with the parasitoids’ pupae, inputting information about the host specimen they emerged from and taking photos. They, too, will be sent for DNA barcoding to better understand the web of life in the ACG.

The results of this process have been astonishing. Almost 200 new parasitoid wasp species were discovered where only three had previously been described. At least 3,000 more species have been barcoded and are awaiting the attention of a taxonomist to formally introduce them to science.

While the thrill of discovery is an end in itself, the library of species that BioAlfa will help create will also be of economic importance. A DNA barcode is just a way to identify an organism but the genome – its entire sequence – can prove lucrative: the basis for new discoveries in medicine, agriculture, food and beauty.

Alongside conservation and sustainability, sharing the benefits from genetic resources is the third and often ignored pillar of the UN convention on biological diversity which will hopefully produce the “Paris agreement for nature” in 2021. Developing countries, which are normally the most biodiverse, want just payment for the riches that might hide in their ecosystems.

Janzen and Hallwachs, alongside the Costa Rican government, are well aware of this issue and the potential economic benefits of BioAlfa. Anyone who spends long enough with Janzen will see his trusty comb emerge from his back pocket – the beginning of a story about the future of being able to identify any organism anywhere with a device that connects to an iPhone. Using a sensor the size of a comb, he says, farmers will be able to calculate the economic cost of cutting down rainforest for cattle or monoculture crops by rapidly checking areas for potential discoveries.

Costa Rica already has had early success bioprospecting. South of the ACG is the Nicoya peninsula, one of the four blue zones on Earth where humans routinely live above the age of 100. In 2017, Chanel launched its Blue Serum skincare range, which uses ingredients from here. Antioxidants from the region’s green coffee were used and the Costa Rican government received payment. BioAlfa’s library of life might bring many more paydays.

Unknown extinctions

Heading out into the dry tropical forest a short drive from Janzen and Hallwachs’ cabin, we inspect the number of moths that have emerged in the first few weeks of the rainy season.

To the untrained eye, the hundreds of insects on the white sheet in the darkness overwhelm and exhilarate in equal measure. Moths the size of birds flutter around my head, brushing my ears, legs and every uncovered body part. Geckos lurk on the corner of the sheet picking off the smaller moths. Mexican burrowing toads belch in unison in the valley below the lamp station. But the couple are quick to temper my naive exuberance.

There used to be many more, Hallwachs quietly assures me as we stand with the darkness at our backs, looking at the spectacular display. “There are all kinds of species missing,” she says.

The next day Janzen shows me a picture of the same light station in 1984 – it is barely possible to see the white linen under the layer of moths.

As Hallwachs shows me to my room on the first night of my second visit to the park, I point through stormy weather conditions to fireflies blinking around the trees. It is a species of firefly that only appears in the first few weeks of the rainy season, she says.

“Firefly numbers are going down around the world. And they’re not nearly as abundant as they used to be. But they are magical. They’re totally magical,” she says, as we crouch together in the rain admiring them.

We have endangered species eating endangered species to keep themselves going
Daniel Janzen

The ACG is marked by human extraction: scar marks on the chicle trees, which were targeted in the second world war to provide chewing gum; stumps of mahogany, still rock solid decades after they were felled; the mangrove forest that was cut down for textile dye. All are indicators of the overconsumption driving biodiversity loss around the world.

On my final day with the couple, they indulge my interest in the beach on the western flank of the ACG, which might have the largest concentration of jaguars in Central America. In the middle of another rainy season storm, Janzen stops the 4x4 we are travelling in to explain why.

“When I got here in about 1971, I met an old jaguar hunter who hunted with dogs. And he said to me, not bragging, just matter-of-factly, that he normally got five to six jaguars per year out of this valley. So a few years later, I’m exploring this valley for caterpillars and all that. And I look around me as a hunter, as somebody who understands wild food. And I say to myself, ‘no way does this valley support five to six jaguars a year’.

“Years later, a biologist named Luis Fonseca started studying the nesting of sea turtles on this beach down here. And right away he discovered the jaguars were killing the sea turtles – not the eggs – but the whole adult.

“There are four species of turtles that nest on this beach. Two are regular all year round. So there’s the food! We have endangered species eating endangered species to keep themselves going.”

There used to be more of everything, everyone is certain, but quantifying what else might be slipping away is hard when there are millions of species left to document. Maybe DNA barcoding can rectify that.

Find more age of extinction coverage here, and follow biodiversity reporters Phoebe Weston and Patrick Greenfield on Twitter for all the latest news and features

Sours: https://www.theguardian.com/environment/2020/oct/07/counting-the-species-how-dna-barcoding-is-rewriting-the-book-of-life-aoe

332 333 334 335 336