R/stringdist_join.R
stringdist_join.Rd
Join two tables based on fuzzy string matching of their columns. This is useful, for example, in matching free-form inputs in a survey or online form, where it can catch misspellings and small personal changes.
stringdist_join( x, y, by = NULL, max_dist = 2, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), mode = "inner", ignore_case = FALSE, distance_col = NULL, ... ) stringdist_inner_join(x, y, by = NULL, distance_col = NULL, ...) stringdist_left_join(x, y, by = NULL, distance_col = NULL, ...) stringdist_right_join(x, y, by = NULL, distance_col = NULL, ...) stringdist_full_join(x, y, by = NULL, distance_col = NULL, ...) stringdist_semi_join(x, y, by = NULL, distance_col = NULL, ...) stringdist_anti_join(x, y, by = NULL, distance_col = NULL, ...)
x | A tbl |
---|---|
y | A tbl |
by | Columns by which to join the two tables |
max_dist | Maximum distance to use for joining |
method | Method for computing string distance, see
|
mode | One of "inner", "left", "right", "full" "semi", or "anti" |
ignore_case | Whether to be case insensitive (default yes) |
distance_col | If given, will add a column with this name containing the difference between the two |
... | Arguments passed on to |
If method = "soundex"
, the max_dist
is
automatically set to 0.5, since soundex returns either a 0 (match)
or a 1 (no match).
library(dplyr) library(ggplot2) data(diamonds) d <- data_frame(approximate_name = c("Idea", "Premiums", "Premioom", "VeryGood", "VeryGood", "Faiir"), type = 1:6) # no matches when they are inner-joined: diamonds %>% inner_join(d, by = c(cut = "approximate_name"))#> # A tibble: 0 x 11 #> # … with 11 variables: carat <dbl>, cut <chr>, color <ord>, clarity <ord>, #> # depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>, #> # type <int># but we can match when they're fuzzy joined diamonds %>% stringdist_inner_join(d, by = c(cut = "approximate_name"))#> # A tibble: 74,907 x 12 #> carat cut color clarity depth table price x y z #> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> #> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 #> 2 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 #> 3 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 #> 4 0.290 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 #> 5 0.290 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 #> 6 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 #> 7 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 #> 8 0.24 Very… I VVS1 62.3 57 336 3.95 3.98 2.47 #> 9 0.24 Very… I VVS1 62.3 57 336 3.95 3.98 2.47 #> 10 0.26 Very… H SI1 61.9 55 337 4.07 4.11 2.53 #> # … with 74,897 more rows, and 2 more variables: approximate_name <chr>, #> # type <int>