Join two tables based on fuzzy string matching of their columns. This is useful, for example, in matching free-form inputs in a survey or online form, where it can catch misspellings and small personal changes.

stringdist_join(
  x,
  y,
  by = NULL,
  max_dist = 2,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  mode = "inner",
  ignore_case = FALSE,
  distance_col = NULL,
  ...
)

stringdist_inner_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_left_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_right_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_full_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_semi_join(x, y, by = NULL, distance_col = NULL, ...)

stringdist_anti_join(x, y, by = NULL, distance_col = NULL, ...)

Arguments

x

A tbl

y

A tbl

by

Columns by which to join the two tables

max_dist

Maximum distance to use for joining

method

Method for computing string distance, see stringdist-metrics in the stringdist package.

mode

One of "inner", "left", "right", "full" "semi", or "anti"

ignore_case

Whether to be case insensitive (default yes)

distance_col

If given, will add a column with this name containing the difference between the two

...

Arguments passed on to stringdist

Details

If method = "soundex", the max_dist is automatically set to 0.5, since soundex returns either a 0 (match) or a 1 (no match).

Examples

library(dplyr) library(ggplot2) data(diamonds) d <- data_frame(approximate_name = c("Idea", "Premiums", "Premioom", "VeryGood", "VeryGood", "Faiir"), type = 1:6) # no matches when they are inner-joined: diamonds %>% inner_join(d, by = c(cut = "approximate_name"))
#> # A tibble: 0 x 11 #> # … with 11 variables: carat <dbl>, cut <chr>, color <ord>, clarity <ord>, #> # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>, #> # type <int>
# but we can match when they're fuzzy joined diamonds %>% stringdist_inner_join(d, by = c(cut = "approximate_name"))
#> # A tibble: 74,907 x 12 #> carat cut color clarity depth table price x y z #> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> #> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 #> 2 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 #> 3 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 #> 4 0.290 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 #> 5 0.290 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 #> 6 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 #> 7 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 #> 8 0.24 Very… I VVS1 62.3 57 336 3.95 3.98 2.47 #> 9 0.24 Very… I VVS1 62.3 57 336 3.95 3.98 2.47 #> 10 0.26 Very… H SI1 61.9 55 337 4.07 4.11 2.53 #> # … with 74,897 more rows, and 2 more variables: approximate_name <chr>, #> # type <int>