Lookup Tables

This pages serves as an example of one approach to cleaning specimen data from a pre-existing source, here Excel, for import to Arctos.

Data are typically stored in a single table/spreadsheet, with a single assertive column, e.g., “Sex.”

That will map to an Attribute “sex” in Arctos. That value is controlled, and Arctos won’t accept things that aren’t in the appropriate code table. We need to standardize the data.

We first run a DISTINCT query

which returns distinct values of sex into a table, which we can then export as CSV. The query returns 29 rows, including NULL.

Somewhere in the data, sex is given as “SO” (which looks like the old Skin Only code – data have probably been shifted at some point, a common occurence). I can’t tell (and it doesn’t matter) if “SO” is used once or eleventy-bajillion times – this is why it’s generally the same amount of work to clean up 1 record or a million, and it’s (partially) why we store the data relationally.

We can add count to the query to satisfy our curiosity

The Arctos code table – the list of acceptable values for attribute “sex” and (hopefully) their definitions is

http://arctos.database.museum/info/ctDocumentation.cfm?table=CTSEX_CDE

or collection-type-specific values, when appropriate, may be found at e.g.,

http://arctos.database.museum/info/ctDocumentation.cfm?table=CTSEX_CDE&coln=Bird

The value “M” in these data needs mapped to the Arctos value “male,” which w do by adding a “shouldBe” column to the table of distinct values, and populating it with values from the Arctos Code Table.

For initial import, where legacy data have been stored as text-string assertions, that is usually sufficient. Other necessary values are, with the guidance of collection’s personnel, simply defaulted in from existing data – determiner=collector, determined date=collecting date, etc. However, Arctos Attributes can do more. Attributes are structured as

or in the bulkloader, from where data will be loaded to Arctos,

Using that, rather than the simple lookup table described above, we could create the Attributes column set, or parts thereof, from the given values

The returned lookup table is then used to update specimen data and/or populate the bulkloader structure. This is often done by adding columns to the original data export, then using the lookup table to form an update query. Whatever technique is used, a join is made between the specimens and the lookup table, so it is critically important to never update the extracted values.