The Data Model
Arctos is based on a deeply relational, highly-normalized data model. “Relational” and “normalized” are means by which data are made more predictable, and predictable data are inherently discoverable. Normalization (the process of reducing redundancy and dependency) manifests itself in self-customizing, highly-extensible code. For example, adding “fields” to record information about a new type of measurement or fact (such as “clutch size of nest parasite”) is only an entry into a code table, requiring no programming or structural changes. Linking identifiers such as catalog number and locality identifier is similar; an entry in a code table turns all new and existing identifiers into links.
Standardized, predictable data may be used to form resolvable, reciprocal links to internal and external relationships between data objects, such as hosts to parasites or specimens to GenBank accessions. Those links, in turn, allow asking cross-resource questions, such as “what parasites of Canis are documented in Arctos?” Arctos was the first collection management system to develop reciprocal linkages between specimens and GenBank sequence data, thereby establishing the current standard for specimen-specific unique identifiers.
Arctos data are available online as HTML and through various services (e.g., RDF, JSON, XML). Data are updated frequently, and new or changed records are refreshed within 24 hours. All Arctos data, unless explicitly restricted by Curators, are available for use. Upon request, Arctos will provide DataCite DOIs for arbitrary Arctos “nodes,” including specimens and media. DOIs are persistent, globally-unique, protocol-agnostic identifiers that remain stable through URL, system, or protocol changes, making DOI-bearing data resolvable to specimen records into the foreseeable future.
As a highly normalized system, there are an infinite number of data objects, and seldom only one path to any of them. For example, specimens may be located through specimen search, as citations in publications, as representatives of taxonomy, or as part of loans used in projects – specimens contribute to all of those things, and all of those things are part of specimens. For this reason, we prefer to view Arctos as an ecosystem of components (specimens, loans, projects, citations, etc.) rather than as a database. These components “plug into” each other and also freely communicate with third-party services. Adding new services as they become available is generally trivial. All of these components work together to document the value of specimens and the products produced through the use of specimens, making Arctos a uniquely-rich center of specimen-related data and the tools to explore and visualize those data in novel ways.
- GeoLocate provides semi-automated georeferencing. Specimens and media have an inline “thumbnail map” that is dynamically generated via Google Enterprise tools. BerkeleyMapper provides more sophisticated mapping capabilities, including error representation and range map overlays. Spatial query capabilities are powered by Google Maps, and coordinates are transformed into standardized descriptive searchable text via reverse georeferencing services. Similar services check and suggest coordinates in locality editing forms.
- Taxonomy classification data comes from both local data and GlobalNames services.
- GenBank is crawled nightly for new, otherwise-undocumented sequences representing Arctos specimens, and reciprocal hyperlinks are created.
- The Texas Advanced Computing Center (TACC) provides secure petabyte-scale storage. TACC provides media hosting and processing, including automated Optical Character Recognition (OCR) on text (e.g., herbarium sheet images).
- Google Custom Search provides text-match search of specimens, taxonomy, publications, and projects.
- Arctos Projects group transactions to provide a self-maintaining summary of specimen usage, document collecting efforts, provide linkages to related projects, and document the work of individuals.
- EZID providesglobally unique, persistent identifiers for Arctos data, ensuring future discoverability.
- Arctos collaborates with other resources aimed at aggregating biodiversity data, improving data quality, and making the data available in broader contexts. These third-party systems complement Arctos by making the data more discoverable, and are in no way competitors of, or alternatives to, Arctos. Data sharing is based on the Darwin Core standard (Wieczorek et al. 2012). Summary data are published as Darwin Core Archives using the Integrated Publishing Toolkit (IPT) to VertNet, Global Biodiversity Information Facility, Integrated Digitized Biocollections (iDigBio), Biodiversity Information Serving our Nation (BISON), Berkeley Ecoinformatics Engine, and other portals. It is important to note that these DarwinCore-based systems are neither a “database” nor a content management system.
Arctos and iDigBio
iDigBio is tasked with digitizing U.S. holdings of specimen data and making those data available in a common portal using as the DarwinCore standard. Hardware and software development are explicitly excluded from iDigBio’s mission (https://www.idigbio.org/about/project-scope).
Arctos contains data digitized through iDigBio projects, and in turn provides those data to the iDigBio portal through the Integrated Publishing Toolkit. The iDigBio portal is a way of exploring data from a wide variety of sources, including Arctos, through a common access point. iDigBio helps make Arctos data, and the specimens represented by Arctos data, more visible to the biodiversity community and more available for research. The relationship between Arctos and iDigBio is mutually beneficial and noncompetitive.
Additionally, tools developed in Arctos such as photo-based digitization help meet iDigBio’s goals by providing exceptionally rapid first-pass digitization capabilities (projects funded by National Science Foundation: NSF PLR-1023407, NSF EF-1115056, NSF DBI-1057426). These tools are available to any Arctos participant.