Sample results are measurement results (i.e., numeric) associated with an ex-situ procedure performed on a physical specimen collected at a site in the field. This is represented in the datbase as:
Results are associated with the “feature action” of analyzing the sample. Multiple variables can be associated with the same procedure in the lab, such as multiple peak values from the GC. Each variable is described as a separate result with its own units and missing data value.
See the bottom of this page for a diagram.
Prepare data
Dataset to upload must have columns for:
- Timestamp formatted as: YYYY-MM-DD HH:MM:SS
- Site
- Sample ID
- One column for each variable.
Timestamp will be used for the time the sample was collected. If you want to specify the time the sample was analyzed in the lab, use a column named “Timestamp_analysis” (otherwise this defaults to the current time). Make sure the time is formatted as YYYY-MM-DD HH:MM:SS. If you need to specify a time zone other than the current one, use a POSIXct data type with the correct time zone specified. To convert from any other (non-ambiguous) date format, you can use the strptime()
functon.
Example input data set:
soil_ph <- data.frame(
Timestamp = "2018-02-05 23:59:59",
Site = "QB SC-A",
Sample = "CEM 2",
pH = 4.5,
stringsAsFactors = FALSE
)
2018-02-05 23:59:59 |
QB SC-A |
CEM 2 |
4.5 |
Click here to see a list of the site codes currently in the database.
Variables and units
Metadata about variables needs to be stored in a list with the format:
list("Variable1 Name" = list(column = "ColumnName", units = "Units Name"),
"Variable2 Name" = list(column = "ColumnName", units = "Units Name"))
vars_list = list("pH" = list(column = "pH", units = "pH"))
If the column name is the exact name of the variable from the controlled vocabulary (i.e. with spaces and correct capitalization), column does not need to be specified. If there are columns in your data that have data quality codes or censor codes, those should be specified here as qualitycodecol and censorcodecol respectively. You can also specify a variabletypecv, variabledefinition and nodatavalue for variables that haven’t been entered to the database before.
Variable names and units names need to be from the controlled vocabularies. To see these options, use the get_cv_terms()
function (requires having the rodm2
package loaded, see below).
get_cv_terms("variablename")
get_cv_terms("units")
Connect
Load the rodm2
package, and connect to the database. Or for testing or other purposes you can make a new sqlite database. In this example the database connection object is db
.
devtools::install_github("khondula/rodm2")
library(rodm2)
db <- create_sqlite(connect = TRUE)
Insert data
Use the db_insert_results_samples()
function from the rodm2 package. You must supply the database connection, data frame, variables list, code names for the methods used in the field and lab, and the sampled medium from the controlled vocabulary (e.g. Air, Soil, etc.). Sampled medium choices can be viewed using get_cv_terms("medium")
.
db_insert_results_samples(db = db,
datavalues = soil_ph,
field_method = "test field method",
lab_method = "test lab method",
variables = vars_list,
sampledmedium = "Soil")
Raw data has been inserted into the processinglevel table.
Sample CEM 2 and associated data have been entered.
[[1]]
NULL
Arguments with defaults
Use these arguments if needed to change defaults:
- site_code_col = “Site”
- sample_code_col = “Sample”
- processinglevel = “Raw data”
- aggregationstatistic = “Unknown”
- time_aggregation_interval = list(1, “Minute”)
Optional arguments
- field_actionby: specify who led the field action
- lab_actionby: specify who led the lab action
- field_equipment_name: specify the name of the equipment used in the field
- lab_equipment_name: specify the name of equipment used in the lab
- zlocation: specify a vertical offset from the site location to associate with the results
- zlocationunits: specify the units of the vertical offset
Note: field_actionby & lab_actionby must be people who are already in the database (can be added using the db_describe_person()
function). Similarly, field_equipment_name & lab_equipment_name must be equipment already in the database (can be added using the db_describe_equipment()
function).
Query results
will need to make this a little more complex to get site collected from and lab method, but for now this proves that the data was inserted.
SELECT datavalue, res.resultid, var.variablenamecv, units.unitsname, valuedatetime, sf.samplingfeaturecode, meth.methodcode
FROM measurementresultvalues mrv
INNER JOIN results res ON mrv.resultid = res.resultid
INNER JOIN variables var ON res.variableid = var.variableid
INNER JOIN featureactions fa ON res.featureactionid = fa.featureactionid
INNER JOIN samplingfeatures sf ON sf.samplingfeatureid = fa.samplingfeatureid
INNER JOIN actions acts ON acts.actionid = fa.actionid
INNER JOIN units units ON units.unitsid = res.unitsid
INNER JOIN methods meth ON meth.methodid = acts.methodid
4.5 |
1 |
pH |
pH |
2018-02-05 23:59:59 |
CEM 1 |
test lab method |
Under the hood
This diagram shows the connections between the tables that are potentially updated when adding samples. Gray boxes are reference tables, blue boxes describe the actual data values, and yellow tables describe the sample collection and analysis.
---
title: "Insert sample results"
output: html_notebook
---


Sample results are measurement results (i.e., numeric) associated with an ex-situ procedure performed on a physical specimen collected at a site in the field. This is represented in the datbase as:

* a **sample collection action**, specifying how (method) and when (Timestamp) the sample (sample code) was taken. For example, "Sample 001 collected on 2018-02-05 13:00:00 using the 1 inch 30 cm soil core method."
* Where the sample was collected is recorded using the "was collected at" relationship type: "Sample 001 was collected at Site A".
* A **sample analysis action**, specifying how and when the results were produced, i.e. how and when the sample was analyzed in the lab. 

Results are associated with the "feature action" of analyzing the sample. Multiple variables can be associated with the same procedure in the lab, such as multiple peak values from the GC. Each variable is described as a separate result with its own units and missing data value. 

See the bottom of this page for a diagram.

# Prepare data

Dataset to upload must have columns for:

* Timestamp formatted as: YYYY-MM-DD HH:MM:SS
* Site
* Sample ID
* One column for each variable. 

Timestamp will be used for the time the sample was collected. If you want to specify the time the sample was analyzed in the lab, use a column named "Timestamp_analysis" (otherwise this defaults to the current time). Make sure the time is formatted as YYYY-MM-DD HH:MM:SS. If you need to specify a time zone other than the current one, use a POSIXct data type with the correct time zone specified. To convert from any other (non-ambiguous) date format, you can use the `strptime()` functon. 

Example input data set:

```{r}
soil_ph <- data.frame(
  Timestamp = "2018-02-05 23:59:59",
  Site = "QB SC-A",
  Sample = "CEM 2",
  pH = 4.5,
  stringsAsFactors = FALSE
)
```

```{r echo=FALSE, results='asis'}
knitr::kable(soil_ph)
```
Click [here](https://palmerlab.umd.edu/choptank-db/current-sites.nb.html) to see a list of the site codes currently in the database. 

## Variables and units 

Metadata about variables needs to be stored in a list with the format:

```
list("Variable1 Name" = list(column = "ColumnName", units = "Units Name"),
     "Variable2 Name" = list(column = "ColumnName", units = "Units Name"))
```

```{r}
vars_list = list("pH" = list(column = "pH", units = "pH"))
```

If the column name is the exact name of the variable from the controlled vocabulary (i.e. with spaces and correct capitalization), column does not need to be specified. If there are columns in your data that have data quality codes or censor codes, those should be specified here as  qualitycodecol and censorcodecol respectively. You can also specify a variabletypecv, variabledefinition and nodatavalue for variables that haven’t been entered to the database before.

Variable names and units names need to be from the controlled vocabularies. To see these options, use the `get_cv_terms()` function (requires having the `rodm2` package loaded, see below).

```{r, eval = FALSE}
get_cv_terms("variablename")
get_cv_terms("units")
```

## Connect 

Load the `rodm2` package, and connect to the database. Or for testing or other purposes you can make a new sqlite database. In this example the database connection object is `db`.

```{r, message=FALSE, comment=FALSE, warning=FALSE}
devtools::install_github("khondula/rodm2")
library(rodm2)
db <- create_sqlite(connect = TRUE)
```


## Insert data

Use the `db_insert_results_samples()` function from the rodm2 package. You must supply the database connection, data frame, variables list, code names for the methods used in the field and lab, and the sampled medium from the controlled vocabulary (e.g. Air, Soil, etc.). Sampled medium choices can be viewed using `get_cv_terms("medium")`.

```{r}
db_insert_results_samples(db = db,
                          datavalues = soil_ph,
                          field_method = "test field method",
                          lab_method = "test lab method",
                          variables = vars_list,
                          sampledmedium = "Soil")
```


**Arguments with defaults**

Use these arguments if needed to change defaults:

* site_code_col = "Site"
* sample_code_col = "Sample"
* processinglevel = "Raw data"
* aggregationstatistic = "Unknown"
* time_aggregation_interval = list(1, "Minute")

**Optional arguments**

* field_actionby: specify who led the field action
* lab_actionby: specify who led the lab action
* field_equipment_name: specify the name of the equipment used in the field
* lab_equipment_name: specify the name of equipment used in the lab
* zlocation: specify a vertical offset from the site location to associate with the results
* zlocationunits: specify the units of the vertical offset

> Note: field_actionby & lab_actionby must be people who are already in the database (can be added using the `db_describe_person()` function).
> Similarly, field_equipment_name & lab_equipment_name must be equipment already in the database (can be added using the `db_describe_equipment()` function).

## Query results

will need to make this a little more complex to get site collected from and lab method, but for now this proves that the data was inserted. 

```{sql, connection=db, output.var = "dbout"}
SELECT datavalue, res.resultid, var.variablenamecv, units.unitsname, valuedatetime, sf.samplingfeaturecode, meth.methodcode
FROM measurementresultvalues mrv
INNER JOIN results res ON mrv.resultid = res.resultid
INNER JOIN variables var ON res.variableid = var.variableid
INNER JOIN featureactions fa ON res.featureactionid = fa.featureactionid
INNER JOIN samplingfeatures sf ON sf.samplingfeatureid = fa.samplingfeatureid
INNER JOIN actions acts ON acts.actionid = fa.actionid
INNER JOIN units units ON units.unitsid = res.unitsid
INNER JOIN methods meth ON meth.methodid = acts.methodid
```

```{r, echo=FALSE, results='asis'}
dbout %>% knitr::kable()
```

## Under the hood

This diagram shows the connections between the tables that are potentially updated when adding samples. Gray boxes are reference tables, blue boxes describe the actual data values, and yellow tables describe the sample collection and analysis. 

![](img/samples.png)
