In a tidy dataset, observations are organised into rows, and they
have an identifier. The base R data.frame comes with
row.names()
. The modernised tibble
allows the
easy addition of the row identifier (as a primary key to the dataset’s
tabular form). The data.table
makes indexing more
efficient. The ts
time-series object or its more modern
tsibble
form adds clear identifiers for the time dimensions
of the data.
The usability of a dataset is increased if we can easily and
unambiguously add new observations with rbind(...)
or a
similar function. Without logical or semantic confusion, this is only
possible if the binding of new observations uses the same identifiers.
Eventually, the dataset will be the most usable if the identifiers are
global, persistent and unique, enabling data linking via the web.
Recalling the first observation from the Example 9 of the RDF
Data Cube Vocabulary definition, the eg:
abbreviation
is a shorthand of a Uniform Resource Identifier (URI) or
Internationalized Resource Identifier (IRI) in the
https://example.com/
domain. Normally, this observation
identifier should resolve to a globally unique identifier which is
available on the World Wide Web as a human and machine-readable
identifier (with optional description.)
# Example 9 of the RDF Data Cube Vocabulary definition
eg:o1 a qb:Observation;
qb:dataSet eg:dataset-le1 ;
eg:refArea ex-geo:newport_00pr ;
eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ;
sdmx-dimension:sex sdmx-code:sex-M ;
sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ;
eg:lifeExpectancy 76.7 ;
.
To go back to the iris
dataset: the observation (row)
identifier is certainly not globally unique because the row identifiers
1
, 2
, … 150
are present by
default in any R data frame with 150 rows. What would make them unique
if the eg:
shorthand would resolve to a unique identifier.
The simplest way to create such a unique identifier is to derive the
root of the observation (row) identifier from a globally unique
identifier of the dataset.
data("iris")
eg_iris <-iris
row.names(eg_iris) <- paste0("eg:o", row.names(iris))
head(eg_iris)[1:6,]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> eg:o1 5.1 3.5 1.4 0.2 setosa
#> eg:o2 4.9 3.0 1.4 0.2 setosa
#> eg:o3 4.7 3.2 1.3 0.2 setosa
#> eg:o4 4.6 3.1 1.5 0.2 setosa
#> eg:o5 5.0 3.6 1.4 0.2 setosa
#> eg:o6 5.4 3.9 1.7 0.4 setosa
We placed a copy of the (semantically enhanced) version of the famous
iris dataset with the digital object identifier (DOI)
10.5281/zenodo.10396807
to the Zenodo data repository that
promises many decades of availability for this copy. The DOI identifier
is designed so that the
https://doi.org/10.5281/zenodo.10396807
URI, as a URL,
i.e., universal resource locator, dereferences to the actual location of
the dataset as a resource.
row.names(eg_iris) <- paste0("https://doi.org/10.5281/zenodo.10396807:o", row.names(iris))
head(eg_iris)[1:6,]
#> Sepal.Length Sepal.Width
#> https://doi.org/10.5281/zenodo.10396807:o1 5.1 3.5
#> https://doi.org/10.5281/zenodo.10396807:o2 4.9 3.0
#> https://doi.org/10.5281/zenodo.10396807:o3 4.7 3.2
#> https://doi.org/10.5281/zenodo.10396807:o4 4.6 3.1
#> https://doi.org/10.5281/zenodo.10396807:o5 5.0 3.6
#> https://doi.org/10.5281/zenodo.10396807:o6 5.4 3.9
#> Petal.Length Petal.Width Species
#> https://doi.org/10.5281/zenodo.10396807:o1 1.4 0.2 setosa
#> https://doi.org/10.5281/zenodo.10396807:o2 1.4 0.2 setosa
#> https://doi.org/10.5281/zenodo.10396807:o3 1.3 0.2 setosa
#> https://doi.org/10.5281/zenodo.10396807:o4 1.5 0.2 setosa
#> https://doi.org/10.5281/zenodo.10396807:o5 1.4 0.2 setosa
#> https://doi.org/10.5281/zenodo.10396807:o6 1.7 0.4 setosa
Replacing the eg:
shorthand or prefix to
https://doi.org/zenodo.10396807
uniquely identifies each
observation (row) in our semantically enriched version of the
iris
dataset.
The dataset package aims to add this functionality to R data frames to be serialised into a format to the semantic web or web of data. While the addition can be made with 0.2.9, we will develop more helper functions after user feedback.