Understanding Left Join and NA Issues in R: A Case Study on Handling Non-Breaking Spaces

Understanding Left Join and NA Issues in R

In this article, we will delve into the world of data manipulation using the tidyverse library in R. Specifically, we will explore an issue with left join that resulted in unexpected results, and how to resolve it.

Background on Tidy Tuesday Data

The problem is based on data from this week’s Tidy Tuesday. The data is divided into two datasets: breed_traits and breed_rank_all. Both datasets contain information about dog breeds, including their rank in 2013.

Understanding Left Join

A left join is a type of join that combines rows from two tables based on a common column. In this case, we are trying to combine the breed_traits table (which contains information about affection levels) with the breed_rank_all table (which contains rank information).

When performing a left join, R will return all rows from the first table and match them with the corresponding rows in the second table. If there is no matching row in the second table, the result will be NA.

The Problem: NAs in Left Join

In this case, we are experiencing NAs in our left join results, even though both datasets have data for each breed. To investigate further, let’s examine the breed_traits and breed_rank_all datasets:

library(tidyverse)

breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv') %>%
  select(Breed, `Affectionate With Family`)

# A tibble: 195 × 2
   Breed                         `Affectionate With Family`
   <chr>                              <dbl>
 1 Retrievers (Labrador)               5
 2 French Bulldogs                     5
 3 German Shepherd Dogs                5
 4 Retrievers (Golden)                 5
 5 Bulldogs                            4
 6 Poodles                             5
 7 Beagles                             3
 8 Rottweilers                         5
 9 Pointers (German Shorthaired)       5
10 Dachshunds                          5

breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv') %>%
  select(Breed, `Rank 2013`)

# A tibble: 195 × 2
   Breed                         `2013 Rank`
   <chr>                           <dbl>
 1 Retrievers (Labrador)             1
 2 French Bulldogs                  11
 3 German Shepherd Dogs             2
 4 Retrievers (Golden)               3
 5 Bulldogs                          5
 6 Poodles                           8
 7 Beagles                           4
 8 Rottweilers                       9
 9 Pointers (German Shorthaired)     13
10 Dachshunds                      10

Initial Investigation: Checking for Extra White Spaces

The user initially suspected that the issue might be due to extra white spaces in the Breed column. However, after checking, they found that there were no such issues:

# Check for extra white spaces
utf8::utf8_print(breed_traits$Breed[1], utf8 = FALSE)
# [1] "Retrievers (Labrador)"

Investigation: Using Stringi and iconv

The user then used stringi to investigate the encoding of the data. They found that the non-breaking space was not visible when using UTF-8 encoding:

# Use stringi to print character in UTF-8 format
iconv(breed_traits$Breed[1], toRaw = TRUE)
[[1]]
 [1] 52 65 74 72 69 65 76 65 72 73 c2 a0 28 4c 61 62 72 61 64 6f 72 29

However, when they used stringi to print the character in ASCII format, the non-breaking space was visible:

# Use stringi to print character in ASCII format
iconv(breed_rank_all$Breed[1], toRaw = TRUE)
[[1]]
 [1] 52 65 74 72 69 65 76 65 72 73 c2 a0 28 4c 61 62 72 61 64 6f 72 29

Solution: Replacing Non-Breaking Spaces with Regex

To solve the issue, we can use regular expressions to replace non-breaking spaces in the Breed column:

# Define a function to replace non-breaking spaces
replSp <- function(x) {
  str_replace_all(string = x,
                 pattern = "[[:space:]]",
                 replacement = " ")
}

# Use the function on the Breed column of breed_traits
breed_traits <- breed_traits %>% 
  mutate(Breed = replSp(Breed))

# Verify that the replacement was successful
breed_rank_all$Breed[[1]] == replSp(breed_rank_all$Breed[[1]])
# [1] TRUE

Solution: Applying the Replacement to Both Datasets

To ensure consistency, we can apply the same replacement to both datasets:

# Define a function to replace non-breaking spaces
replSp <- function(x) {
  str_replace_all(string = x,
                 pattern = "[[:space:]]",
                 replacement = " ")
}

# Use the function on the Breed column of breed_rank_all
breed_rank_all <- breed_rank_all %>% 
  mutate(Breed = replSp(Breed))

# Verify that the replacement was successful
breed_traits$Breed == replSp(breed_traits$Breed)
# [1] TRUE

Conclusion

In this article, we explored an issue with left join in R where NAs were appearing despite both datasets having data for each breed. We investigated the use of non-breaking spaces and their impact on the data, and ultimately found that using regular expressions to replace them was the solution.

By understanding how to handle character encodings and non-breaking spaces, we can avoid similar issues in our own data manipulation tasks.

Last modified on 2023-06-26