Join two tables based on fuzzy string matching of their columns

Join two tables based on fuzzy string matching of their columns. This is useful, for example, in matching free-form inputs in a survey or online form, where it can catch misspellings and small personal changes.

stringdist_join(
  x,
  y,
  by = NULL,
  max_dist = 2,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  mode = "inner",
  ignore_case = FALSE,
  distance_col = NULL,
  ...
)

stringdist_inner_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_left_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_right_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_full_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_semi_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_anti_join(x, y, by = NULL, distance_col = NULL, ...)

Arguments

x	A tbl
y	A tbl
by	Columns by which to join the two tables
max_dist	Maximum distance to use for joining
method	Method for computing string distance, see `stringdist-metrics` in the stringdist package.
mode	One of "inner", "left", "right", "full" "semi", or "anti"
ignore_case	Whether to be case insensitive (default yes)
distance_col	If given, will add a column with this name containing the difference between the two
...	Arguments passed on to `stringdist`

Details

If method = "soundex", the max_dist is automatically set to 0.5, since soundex returns either a 0 (match) or a 1 (no match).

Examples


library(dplyr)
library(ggplot2)
data(diamonds)

d <- data_frame(approximate_name = c("Idea", "Premiums", "Premioom",
                                     "VeryGood", "VeryGood", "Faiir"),
                type = 1:6)

# no matches when they are inner-joined:
diamonds %>%
  inner_join(d, by = c(cut = "approximate_name"))
#> # A tibble: 0 x 11
#> # … with 11 variables: carat <dbl>, cut <chr>, color <ord>, clarity <ord>,
#> #   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>,
#> #   type <int>

# but we can match when they're fuzzy joined
diamonds %>%
 stringdist_inner_join(d, by = c(cut = "approximate_name"))
#> # A tibble: 74,907 x 12
#>    carat cut   color clarity depth table price     x     y     z
#>    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2 0.21  Prem… E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3 0.21  Prem… E     SI1      59.8    61   326  3.89  3.84  2.31
#>  4 0.290 Prem… I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5 0.290 Prem… I     VS2      62.4    58   334  4.2   4.23  2.63
#>  6 0.24  Very… J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7 0.24  Very… J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  8 0.24  Very… I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  9 0.24  Very… I     VVS1     62.3    57   336  3.95  3.98  2.47
#> 10 0.26  Very… H     SI1      61.9    55   337  4.07  4.11  2.53
#> # … with 74,897 more rows, and 2 more variables: approximate_name <chr>,
#> #   type <int>