Exploring {dm} Alone

database

RDBMS

data management

TidyTuesday

Using data from Alone to test drive the {dm} package as part of week 4 of TidyTuesday

Author

James H Wade

Published

January 25, 2023

Working with relational data tables doesn’t sound like the most exciting topic, but it’s one I always could do better in my data science projects. Kirill Müller drew in quite an audience for his overview of the {dm} package at his rstudio::conf talk in 2022. {dm} is designed to bridge the gap between individual data frames and relational databases, making it a powerful tool for anyone working with large or complex datasets.

{dm} provides a consistent set of verbs for consuming, creating, and deploying relational data models. It makes working with data a lot easier by capturing a relational data models constructed from local data frames or “lazy tables” connected to an RDBMS (Relational Database Management System). With {dm} you can use {dplyr} data manipulation verbs, along with additional methods for constructing and verifying relational data models, including key selection, key creation, and rigorous constraint checking.

One of the most powerful features of {dm} is its ability to scale from datasets that fit in memory to databases with billions of rows. This means that even if your dataset is too large to fit in memory, you can still use {dm} to work with it efficiently.

Creating `dm` from Dataframes

In this tutorial, we will use the {alone} package, part of week for of #TidyTuesday. Our first step is to convert data in the {alone} package into a dm object.

library(dm)


Attaching package: 'dm'

The following object is masked from 'package:stats':

    filter

library(alone)

alone_no_keys <- dm(episodes, loadouts, seasons, survivalists)
alone_no_keys

── Metadata ────────────────────────────────────────────────────────────────────
Tables: `episodes`, `loadouts`, `seasons`, `survivalists`
Columns: 41
Primary keys: 0
Foreign keys: 0

Primary Keys

In practice, we should always inspect our data to ensure we are joining data in a sensible manner, but dm_enum_pk_candidates() can suggest a primary key for us.

dm_enum_pk_candidates(
  dm = alone_no_keys,
  table = episodes
)

# A tibble: 11 × 3
   columns                candidate why                                         
   <keys>                 <lgl>     <chr>                                       
 1 episode_number_overall TRUE      ""                                          
 2 title                  TRUE      ""                                          
 3 version                FALSE     "has duplicate values: US (98)"             
 4 season                 FALSE     "has duplicate values: 2 (13), 1 (11), 6 (1…
 5 episode                FALSE     "has duplicate values: 1 (9), 2 (9), 3 (9),…
 6 air_date               FALSE     "has duplicate values: 2016-07-14 (2)"      
 7 viewers                FALSE     "has 15 missing values, and duplicate value…
 8 quote                  FALSE     "has duplicate values: In nature there are …
 9 author                 FALSE     "has duplicate values: John Muir (4), Ameli…
10 imdb_rating            FALSE     "has duplicate values: 7.7 (16), 7.6 (9), 7…
11 n_ratings              FALSE     "has 5 missing values, and duplicate values…

episode_number and title are the two candidates for primary keys. We can look at the other tables, as well.

dm_enum_pk_candidates(
  dm = alone_no_keys,
  table = loadouts
)

# A tibble: 6 × 3
  columns       candidate why                                                   
  <keys>        <lgl>     <chr>                                                 
1 version       FALSE     has duplicate values: US (940)                        
2 season        FALSE     has duplicate values: 4 (140), 1 (100), 2 (100), 3 (1…
3 name          FALSE     has duplicate values: Brad Richardson (20), Britt Aha…
4 item_number   FALSE     has duplicate values: 1 (94), 2 (94), 3 (94), 4 (94),…
5 item_detailed FALSE     has duplicate values: Ferro rod (66), Sleeping bag (6…
6 item          FALSE     has duplicate values: Pot (92), Fishing gear (90), Sl…

dm_enum_pk_candidates(
  dm = alone_no_keys,
  table = seasons
)

# A tibble: 8 × 3
  columns       candidate why                                                   
  <keys>        <lgl>     <chr>                                                 
1 season        TRUE      ""                                                    
2 version       FALSE     "has duplicate values: US (9)"                        
3 location      FALSE     "has duplicate values: Quatsino (3), Great Slave Lake…
4 country       FALSE     "has duplicate values: Canada (7)"                    
5 n_survivors   FALSE     "has duplicate values: 10 (8)"                        
6 lat           FALSE     "has duplicate values: 50.72444 (3), 61.50028 (2)"    
7 lon           FALSE     "has duplicate values: -127.4981 (3), -114.0011 (2)"  
8 date_drop_off FALSE     "has 6 missing values"

dm_enum_pk_candidates(
  dm = alone_no_keys,
  table = survivalists
)

# A tibble: 16 × 3
   columns             candidate why                                            
   <keys>              <lgl>     <chr>                                          
 1 season              FALSE     has duplicate values: 4 (14), 1 (10), 2 (10), …
 2 name                FALSE     has duplicate values: Brad Richardson (2), Bri…
 3 age                 FALSE     has duplicate values: 31 (7), 40 (6), 44 (6), …
 4 gender              FALSE     has duplicate values: Male (74), Female (20)   
 5 city                FALSE     has duplicate values: Fox (3), Fox Lake (3), S…
 6 state               FALSE     has duplicate values: Alaska (11), Maine (6), …
 7 country             FALSE     has duplicate values: United States (79), Cana…
 8 result              FALSE     has duplicate values: 1 (10), 2 (10), 3 (10), …
 9 days_lasted         FALSE     has duplicate values: 8 (4), 1 (3), 2 (3), 4 (…
10 medically_evacuated FALSE     has duplicate values: FALSE (69), TRUE (25)    
11 reason_tapped_out   FALSE     has 10 missing values, and duplicate values: S…
12 reason_category     FALSE     has 10 missing values, and duplicate values: M…
13 team                FALSE     has 80 missing values, and duplicate values: B…
14 day_linked_up       FALSE     has 86 missing values, and duplicate values: 9…
15 profession          FALSE     has duplicate values: Carpenter (4), Blacksmit…
16 url                 FALSE     has duplicate values: alex-and-logan-ribar (2)…

{loadouts} and {survivalists} lack a single column that serves as a primary key, but we can use a column tuple (i.e., multiple columns) to make a primary key.

alone_only_pks <-
  alone_no_keys %>%
  dm_add_pk(table = episodes, columns = episode_number_overall) %>%
  dm_add_pk(loadouts, c(version, season, name, item_number)) %>%
  dm_add_pk(seasons, season) %>%
  dm_add_pk(survivalists, c(season, name))

alone_only_pks

── Metadata ────────────────────────────────────────────────────────────────────
Tables: `episodes`, `loadouts`, `seasons`, `survivalists`
Columns: 41
Primary keys: 4
Foreign keys: 0

Foreign Keys

To create the relationships between tables, we need to identify foreign keys. We can use the same approach as we did with primary keys, this time with dm_enum_fk_candidates().

dm_enum_fk_candidates(
  dm = alone_only_pks,
  table = episodes,
  ref_table = seasons
)

# A tibble: 11 × 3
   columns                candidate why                                         
   <keys>                 <lgl>     <chr>                                       
 1 season                 TRUE      ""                                          
 2 version                FALSE     "Can't combine `value1` <character> and `va…
 3 episode_number_overall FALSE     "values of `episodes$episode_number_overall…
 4 episode                FALSE     "values of `episodes$episode` not in `seaso…
 5 title                  FALSE     "Can't combine `value1` <character> and `va…
 6 air_date               FALSE     "Can't combine `value1` <character> and `va…
 7 viewers                FALSE     "values of `episodes$viewers` not in `seaso…
 8 quote                  FALSE     "Can't combine `value1` <character> and `va…
 9 author                 FALSE     "Can't combine `value1` <character> and `va…
10 imdb_rating            FALSE     "values of `episodes$imdb_rating` not in `s…
11 n_ratings              FALSE     "values of `episodes$n_ratings` not in `sea…

dm_add_fk() is the same as dm_add_pk() except for foreign keys.

alone_with_keys <-
  alone_only_pks |>
  dm_add_fk(table = episodes, columns = season, ref_table = seasons) |>
  dm_add_fk(loadouts, c(name, season), survivalists) |>
  dm_add_fk(loadouts, season, seasons) |>
  dm_add_fk(survivalists, season, seasons)

alone_with_keys

── Metadata ────────────────────────────────────────────────────────────────────
Tables: `episodes`, `loadouts`, `seasons`, `survivalists`
Columns: 41
Primary keys: 4
Foreign keys: 4

Visualizing Relationships

Two powerful features included with {dm} are relational table visualization and integrity checks. dm_draw() gives us a visualization to inspect the relationships between tables and the keys used to do so.

rlang::check_installed("DiagrammeR")
dm_draw(alone_with_keys)

Integrity Checks

dm_examine_constraints(alone_no_keys)

ℹ No constraints defined.

dm_examine_constraints(alone_only_pks)

ℹ All constraints satisfied.

dm_examine_constraints(alone_with_keys)

! Unsatisfied constraints:

• Table `loadouts`: foreign key `name`, `season` into table `survivalists`: Can't combine `value1` <character> and `value1` <double>.

We can see that there is an issue with constraints for joining loadouts and survivalists.

Table Flattening

In my own projects, the power of a well organized tidy data structure is most evident when I join tidy tables to answer a particular question about the project. The joined table can sometimes be the most valuable step in the product. From these joins, I can usually build visuals and summaries that becomes the most visible artifact of the product.

dm_flatten_to_tbl() uses a table of our choosing as starting point and produces a wide table that brings in information from our other tables. Importantly, columns with the same name but no relationship (i.e., they are not primary <-> foreign keys) are disambiguated. This seems like it could explode for more complicated data structures, but four tables seems manageable.

flat_survivors <- dm_flatten_to_tbl(alone_with_keys, .start = survivalists)

Renaming ambiguous columns: %>%
  dm_rename(survivalists, country.survivalists = country) %>%
  dm_rename(seasons, country.seasons = country)

flat_survivors

# A tibble: 94 × 23
   season name     age gender city  state count…¹ result days_…² medic…³ reaso…⁴
    <dbl> <chr>  <dbl> <chr>  <chr> <chr> <chr>    <dbl>   <dbl> <lgl>   <chr>  
 1      1 Alan …    40 Male   Blai… Geor… United…      1      56 FALSE   <NA>   
 2      1 Sam L…    22 Male   Linc… Nebr… United…      2      55 FALSE   Lost t…
 3      1 Mitch…    34 Male   Bell… Mass… United…      3      43 FALSE   Realiz…
 4      1 Lucas…    32 Male   Quas… Iowa  United…      4      39 FALSE   Felt c…
 5      1 Dusti…    37 Male   Pitt… Penn… United…      5       8 FALSE   Fear o…
 6      1 Brant…    44 Male   Albe… Nort… United…      6       6 FALSE   Consum…
 7      1 Wayne…    46 Male   Sain… New … Canada       7       4 FALSE   Fear o…
 8      1 Joe R…    24 Male   Wind… Onta… Canada       8       4 FALSE   Loss o…
 9      1 Chris…    41 Male   Umat… Flor… United…      9       1 FALSE   Fear o…
10      1 Josh …    31 Male   Jack… Ohio  United…     10       0 FALSE   Fear o…
# … with 84 more rows, 12 more variables: reason_category <chr>, team <chr>,
#   day_linked_up <dbl>, profession <chr>, url <chr>, version <chr>,
#   location <chr>, country.seasons <chr>, n_survivors <dbl>, lat <dbl>,
#   lon <dbl>, date_drop_off <chr>, and abbreviated variable names
#   ¹country.survivalists, ²days_lasted, ³medically_evacuated,
#   ⁴reason_tapped_out

The renaming of ambiguous columns is important in this case since seasons$country refers the location of the show and survivalists$country refers to the nationality of the survivalist.

A simple plot

Usign flattened data, we can make a simple plot of days lasted versus age colored by country.

library(ggplot2)
flat_survivors |>
  ggplot(aes(x = age, y = days_lasted, color = country.survivalists)) +
  geom_point(size = 2, alpha = 0.7) +
  theme_minimal() +
  labs(
    x = "Survivalist Age", y = "Days Alone", color = "Nationality",
    title = "Days on Alone vs Survivalist Age",
    subtitle = "Color Indicates Nationality"
  ) +
  theme(plot.title.position = "plot")

In a future post, I’d like to explore going from files to dm to a database.

Creating dm from Dataframes