Using data from Alone to test drive the {dm} package as part of week 4 of TidyTuesday
Author
James H Wade
Published
January 25, 2023
Working with relational data tables doesn’t sound like the most exciting topic, but it’s one I always could do better in my data science projects. Kirill Müller drew in quite an audience for his overview of the {dm} package at his rstudio::conf talk in 2022. {dm} is designed to bridge the gap between individual data frames and relational databases, making it a powerful tool for anyone working with large or complex datasets.
{dm} provides a consistent set of verbs for consuming, creating, and deploying relational data models. It makes working with data a lot easier by capturing a relational data models constructed from local data frames or “lazy tables” connected to an RDBMS (Relational Database Management System). With {dm} you can use {dplyr} data manipulation verbs, along with additional methods for constructing and verifying relational data models, including key selection, key creation, and rigorous constraint checking.
One of the most powerful features of {dm} is its ability to scale from datasets that fit in memory to databases with billions of rows. This means that even if your dataset is too large to fit in memory, you can still use {dm} to work with it efficiently.
Creating dm from Dataframes
In this tutorial, we will use the {alone} package, part of week for of #TidyTuesday. Our first step is to convert data in the {alone} package into a dm object.
library(dm)
Attaching package: 'dm'
The following object is masked from 'package:stats':
filter
In practice, we should always inspect our data to ensure we are joining data in a sensible manner, but dm_enum_pk_candidates() can suggest a primary key for us.
# A tibble: 16 × 3
columns candidate why
<keys> <lgl> <chr>
1 season FALSE has duplicate values: 4 (14), 1 (10), 2 (10), …
2 name FALSE has duplicate values: Brad Richardson (2), Bri…
3 age FALSE has duplicate values: 31 (7), 40 (6), 44 (6), …
4 gender FALSE has duplicate values: Male (74), Female (20)
5 city FALSE has duplicate values: Fox (3), Fox Lake (3), S…
6 state FALSE has duplicate values: Alaska (11), Maine (6), …
7 country FALSE has duplicate values: United States (79), Cana…
8 result FALSE has duplicate values: 1 (10), 2 (10), 3 (10), …
9 days_lasted FALSE has duplicate values: 8 (4), 1 (3), 2 (3), 4 (…
10 medically_evacuated FALSE has duplicate values: FALSE (69), TRUE (25)
11 reason_tapped_out FALSE has 10 missing values, and duplicate values: S…
12 reason_category FALSE has 10 missing values, and duplicate values: M…
13 team FALSE has 80 missing values, and duplicate values: B…
14 day_linked_up FALSE has 86 missing values, and duplicate values: 9…
15 profession FALSE has duplicate values: Carpenter (4), Blacksmit…
16 url FALSE has duplicate values: alex-and-logan-ribar (2)…
{loadouts} and {survivalists} lack a single column that serves as a primary key, but we can use a column tuple (i.e., multiple columns) to make a primary key.
To create the relationships between tables, we need to identify foreign keys. We can use the same approach as we did with primary keys, this time with dm_enum_fk_candidates().
# A tibble: 11 × 3
columns candidate why
<keys> <lgl> <chr>
1 season TRUE ""
2 version FALSE "Can't combine `value1` <character> and `va…
3 episode_number_overall FALSE "values of `episodes$episode_number_overall…
4 episode FALSE "values of `episodes$episode` not in `seaso…
5 title FALSE "Can't combine `value1` <character> and `va…
6 air_date FALSE "Can't combine `value1` <character> and `va…
7 viewers FALSE "values of `episodes$viewers` not in `seaso…
8 quote FALSE "Can't combine `value1` <character> and `va…
9 author FALSE "Can't combine `value1` <character> and `va…
10 imdb_rating FALSE "values of `episodes$imdb_rating` not in `s…
11 n_ratings FALSE "values of `episodes$n_ratings` not in `sea…
dm_add_fk() is the same as dm_add_pk() except for foreign keys.
Two powerful features included with {dm} are relational table visualization and integrity checks. dm_draw() gives us a visualization to inspect the relationships between tables and the keys used to do so.
• Table `loadouts`: foreign key `name`, `season` into table `survivalists`: Can't combine `value1` <character> and `value1` <double>.
We can see that there is an issue with constraints for joining loadouts and survivalists.
Table Flattening
In my own projects, the power of a well organized tidy data structure is most evident when I join tidy tables to answer a particular question about the project. The joined table can sometimes be the most valuable step in the product. From these joins, I can usually build visuals and summaries that becomes the most visible artifact of the product.
dm_flatten_to_tbl() uses a table of our choosing as starting point and produces a wide table that brings in information from our other tables. Importantly, columns with the same name but no relationship (i.e., they are not primary <-> foreign keys) are disambiguated. This seems like it could explode for more complicated data structures, but four tables seems manageable.
# A tibble: 94 × 23
season name age gender city state count…¹ result days_…² medic…³ reaso…⁴
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <lgl> <chr>
1 1 Alan … 40 Male Blai… Geor… United… 1 56 FALSE <NA>
2 1 Sam L… 22 Male Linc… Nebr… United… 2 55 FALSE Lost t…
3 1 Mitch… 34 Male Bell… Mass… United… 3 43 FALSE Realiz…
4 1 Lucas… 32 Male Quas… Iowa United… 4 39 FALSE Felt c…
5 1 Dusti… 37 Male Pitt… Penn… United… 5 8 FALSE Fear o…
6 1 Brant… 44 Male Albe… Nort… United… 6 6 FALSE Consum…
7 1 Wayne… 46 Male Sain… New … Canada 7 4 FALSE Fear o…
8 1 Joe R… 24 Male Wind… Onta… Canada 8 4 FALSE Loss o…
9 1 Chris… 41 Male Umat… Flor… United… 9 1 FALSE Fear o…
10 1 Josh … 31 Male Jack… Ohio United… 10 0 FALSE Fear o…
# … with 84 more rows, 12 more variables: reason_category <chr>, team <chr>,
# day_linked_up <dbl>, profession <chr>, url <chr>, version <chr>,
# location <chr>, country.seasons <chr>, n_survivors <dbl>, lat <dbl>,
# lon <dbl>, date_drop_off <chr>, and abbreviated variable names
# ¹country.survivalists, ²days_lasted, ³medically_evacuated,
# ⁴reason_tapped_out
The renaming of ambiguous columns is important in this case since seasons$country refers the location of the show and survivalists$country refers to the nationality of the survivalist.
A simple plot
Usign flattened data, we can make a simple plot of days lasted versus age colored by country.
library(ggplot2)flat_survivors |>ggplot(aes(x = age, y = days_lasted, color = country.survivalists)) +geom_point(size =2, alpha =0.7) +theme_minimal() +labs(x ="Survivalist Age", y ="Days Alone", color ="Nationality",title ="Days on Alone vs Survivalist Age",subtitle ="Color Indicates Nationality" ) +theme(plot.title.position ="plot")
In a future post, I’d like to explore going from files to dm to a database.