Arctos in the online specimen data ecosystem

Summary of Arctos

Arctos is a deeply relational, highly-normalized, web-based, community-driven collections management system. “Relational” and “normalized” are means by which data are made more predictable, and predictable data are inherently discoverable. Normalization (the process of reducing redundancy and dependency) manifests itself in self-customizing, highly-extensible code. For example, adding the “fields” to record information about a new type of measurement or fact (such as “clutch size of nest parasite”) is only an entry into a code table, requiring no programming or structural changes. Linking identifiers (such as catalog number and locality identifier) to remote resources is similar; an entry in a code table turns all new and existing identifiers into links. These standardized, predictable data may then be used to form resolvable, reciprocal links to internal and external relationships between data objects, such as hosts to parasites or specimens to GenBank accessions. Those links in turn allow asking cross-resource questions, such as “what parasites of Canis are documented in Arctos?” Curators are not required to write code, patch operating systems, deal with failed drives, or develop backup strategies. The Arctos community provides a collaborative environment for sharing data vocabulary and standards, curating shared data (such as agents and taxonomy), forming strong links between related specimens, and guiding future development of Arctos.

The Ecosystem

As a highly normalized system, there are an infinite number of data objects, and seldom only one path to any of them. Specimens may be located through specimen search, as citations in publications, as representatives of taxonomy, or as part of loans used in projects, for example; specimens contribute to all of those things, and all of those things are part of specimens. We find that this is often somewhat overwhelming to new users, and it is perhaps better to view Arctos as an ecosystem of components (specimens, loans, projects, citations, etc.), rather than “a database.” To extend that idea, along with “plugging into” each other, Arctos components also freely communicate with various external services. Visual georeferencing tools call GeoLocate as a service. Most locality-bearing objects (such as specimens and media) have an inline “thumbnail map,” dynamically generated via Google Enterprise tools. BerkeleyMapper provides more sophisticated mapping capabilities, including error representation and rangemap overlays. Spatial query capabilities are powered by Google Maps, and coordinates are transformed into standardized descriptive searchable text via reverse georeferencing services. Similar services check and suggest coordinates in locality editing forms. Taxonomy classification data comes from both local data and GlobalNames services. GenBank is crawled nightly for new, otherwise-undocumented sequences representing Arctos specimens. The Texas Advanced Computing Center provides secure petabyte-scale storage. Google Custom Search provides text-match search of specimens, taxonomy, publications, and projects. Projects group transactions, thereby providing a self-maintaining summary of specimen usage, documenting collecting efforts, providing linkages to related projects, documenting the work of individuals, and various other curatorially-defined uses. EZID provides persistent identifiers. Adding new services as they become available is generally trivial, again thanks to strong normalization. All of these components work together to rigorously document the value of specimens and the products produced through the use of specimens, making Arctos the uniquely-rich center of various specimen-related data and the tools to explore and visualize those data in novel ways.

Data Availability

All Arctos data (unless explicitly restricted by Curators) are available on the internet, as HTML and through various services as RDF, JSON, and XML. “Public” data are refreshed upon change; the public interfaces contain data less than 1 minute “stale.” Summary data are provided in the DarwinCore Standard (DWC) format through IPT, and included in projects such as VertNet and GBIF. Arctos will upon request provide DataCite DOIs for arbitrary Arctos “nodes,” including specimens and media. DOIs are persistent, globally-unique, protocol-agnostic identifiers that remain stable through URL, system, or protocol changes, making DOI-bearing data resolvable to specimen records into the foreseeable future.

Relationship between Arctos and iDigBio

iDigBio is tasked with digitizing specimen data and making the results of that digitization available through DWC in a common portal. Hardware and software development are explicitly exluded from iDigBio’s mission (https://www.idigbio.org/about/project-scope).

Arctos contains data digitized through iDigBio projects and in turn provides those data to the iDigBio portal through IPT. It is important to note that DWC is neither a “database” nor a content management system, but rather an exchange standard (Wieczorek et al. 2012). Arctos is primarily a specimen data management system. The portal provided by iDigBio is a way of exploring data from a wide variety of sources, including Arctos, through a common access point. iDigBio helps make Arctos data, and the specimens represented by Arctos data, more visible to the biodiversity community and more available for research. The relationship between Arctos and iDigBio is mutually beneficial and noncompetitive.

Additionally, tools developed in Arctos (such as photo-based digitization) help meet iDigBio’s goals by providing exceptionally rapid first-pass digitization capabilities. (NSF PLR-1023407, NSF EF-1115056, NSF DBI-1057426). These tools are available to any Arctos participant.

Alternatives to Arctos

Several products provide alternatives to core Arctos specimen data management functionality, and several products and services replicate some Arctos data in different environments. We know of no product reasonably comparable to Arctos in depth, scope, or connectivity.

* Specify is a locally installed (e.g., software) collection management system. Development direction is little influenced by users, and there is no capability to form resolvable links between Internet resources. Database and server administration, vocabulary management, and publication to other resources are left up to the user. As software (vs. applications such as Arctos), public access is available only through third-party services (such as GBIF) against summary data published through protocols such as IPT+DarwinCore, and therefore deep-query capability is inherently limited.

* CollectionSpace is an online collections management system that differs from Arctos in being middleware based (“database rules” are maintained outside the database, limiting the ways in which users can interact with data without compromising referential integrity), the lack of a data standards community, and a lack of normalization. We know of no CollectionSpace data served via DWC.

* KE-EMU is a commercial, locally installed collections management system built on a proprietary data store. The licensing fee of ~ $1400/user/year includes user support; local technical support, hardware, administration, and backup strategy development and infrastructure maintenance is required. Additional licenses are required for services, such as publishing DWC via DiGIR.

* Arctos software and data definition language (DDL) are freely available, and local installations are possible. One such endeavor (MCZBase) uses “forked” code, derived from but no longer contributing to Arctos. We can provide no support for such installations. All core Arctos software is managed in a collaborative environment, and less-divergent independent installations are possible.

* GBIF, VertNet, FishNet2, BISON, iDigBio, etc. repackage and rehost data from systems such as Arctos, making those data available in broader contexts than any single system can provide. These systems complement Arctos by making specimen data more discoverable, and are in no way “competitors” of or alternatives to Arctos.

* DarwinCore is an exchange standard through which Arctos publishes data to various portals; DWC and the tools built upon DWC-formatted data serve a purpose very different than any content management system.

* Symbiota is a DarwinCore-based “portal” intended for publishing specimen data in a common environment (much like VertNet, GBIF, etc.). Symbiota also includes “basic online support for managing specimen data” but, as with all non-relational data structures, has severe inherent limits in normalization and therefore searchability.

Data Requirements

Various “levels” of digitization are possible or desirable for various purposes. A rapid digitization project might photograph herbarium sheets and capture verbatim taxonomy, or photograph paper data and specimens while entering minimal data, such as Locality ID and geological information. At the other end of the spectrum, a specimen might have several publication-backed identifications, links to and from GenBank and BOLD, relationships to various media, be included in several projects, be linked to and from parasites and co-collected specimens in other collections, and contain user-provided annotations. Largely due to the inherent normalization of Arctos, the entire spectrum of “digitized” data may exist in (or be linked to and from) Arctos, and “upgrading” is a simple (and, for most specimens, continuous) process. “Finalized” specimen data occurs only in static systems, those which cannot record usage data, or those which do not make the specimen data findable and useful.

This entry was posted in Uncategorized. Bookmark the permalink.