Title: | String Similarity Joins for Hamming and Levenshtein Distances |
---|---|
Description: | This project is a tool for words edit similarity joins (a.k.a. all-pairs similarity search) under small (< 3) edit distance constraints. It works for Levenshtein/Hamming distances and words from any alphabet. The software was originally developed for joining amino-acid/nucleotide sequences from Adaptive Immune Repertoires, where the number of words is relatively large (10^5-10^6) and the average length of words is relatively small (10-100). |
Authors: | Daniil Matveev [aut, cre], Martin Leitner-Ankerl [ctb, cph], Gene Harvey [ctb, cph] |
Maintainer: | Daniil Matveev <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2024-11-26 05:07:35 UTC |
Source: | https://github.com/matveevdaniil/rpatternjoin |
This project is a tool for words edit similarity joins under small () edit distance constraints.
It works for Levenshtein distance and Hamming (with allowed insertions/deletions to the end) distance.
The package offers several similarity join algorithms, all of which can be accessed through the similarityJoin
function.
The software was originally developed for edit similarity joins of short amino-acid/nucleotide sequences from Adaptive Immune Repertoires,
where the number of words is relatively large () and the average length of words is relatively small (
).
The algorithms will work with any alphabet and any list of words, however, larger lists or word sizes can lead to memory issues.
Daniil Matveev <[email protected]>
similarityJoin
,
edit_dist1_example
library(RPatternJoin) ## Small example similarityJoin(c("ABC", "AX", "QQQ"), 2, "Hamming", output_format = "adj_pairs") # [,1] [,2] # [1,] 1 1 # [2,] 1 2 # [3,] 2 1 # [4,] 2 2 # [5,] 3 3 ## Larger example # The `edit_dist1_example` function generate a random list # of `num_strings` strings with the average string length=`avg_len`. strings <- edit_dist1_example(avg_len = 25, num_strings = 5000) # Firstly let's do it with `stringdist` package. library(stringdist) unname(system.time({ which(stringdist::stringdistmatrix(strings, strings, "lv") <= 1, arr.ind = TRUE) })["elapsed"]) # Runtime on macOS machine with 2.2 GHz i7 processor and 16GB of DDR4 RAM: # [1] 63.773 # Now let's do it with similarityJoin function. unname(system.time({ similarityJoin(strings, 1, "Levenshtein", output_format = "adj_pairs") })["elapsed"]) # Runtime on the same machine: # [1] 0.105
library(RPatternJoin) ## Small example similarityJoin(c("ABC", "AX", "QQQ"), 2, "Hamming", output_format = "adj_pairs") # [,1] [,2] # [1,] 1 1 # [2,] 1 2 # [3,] 2 1 # [4,] 2 2 # [5,] 3 3 ## Larger example # The `edit_dist1_example` function generate a random list # of `num_strings` strings with the average string length=`avg_len`. strings <- edit_dist1_example(avg_len = 25, num_strings = 5000) # Firstly let's do it with `stringdist` package. library(stringdist) unname(system.time({ which(stringdist::stringdistmatrix(strings, strings, "lv") <= 1, arr.ind = TRUE) })["elapsed"]) # Runtime on macOS machine with 2.2 GHz i7 processor and 16GB of DDR4 RAM: # [1] 63.773 # Now let's do it with similarityJoin function. unname(system.time({ similarityJoin(strings, 1, "Levenshtein", output_format = "adj_pairs") })["elapsed"]) # Runtime on the same machine: # [1] 0.105
This function generates a random list of num_strings = 5n
strings
such that each of n
strings has one duplicate, one string with a deleted letter,
one string with an inserted letter, and one string with a substituted letter.
edit_dist1_example(avg_len = 25, num_strings = 5000)
edit_dist1_example(avg_len = 25, num_strings = 5000)
avg_len |
Average length of the strings. |
num_strings |
Number of strings to generate. |
A character vector of generated strings.
Build Adjacency Matrix
similarityJoin( strings, cutoff, metric, method = "partition_pattern", drop_deg_one = FALSE, special_chars = TRUE, output_format = "adj_matrix" )
similarityJoin( strings, cutoff, metric, method = "partition_pattern", drop_deg_one = FALSE, special_chars = TRUE, output_format = "adj_matrix" )
strings |
Input vector of strings.
To avoid hidden errors, the function will give a warning if strings contain characters not in the English alphabet.
To disable this warning, change |
cutoff |
Cutoff: |
metric |
Edit distance type: |
method |
Method: |
drop_deg_one |
Drop isolated strings: |
special_chars |
Enable check for special characters in strings: |
output_format |
Output format: |
If output_format = adj_pairs
- 2-column matrix where
each row is a pair of indices of strings with an edit distance
cutoff
.
If output_format = adj_matrix
- the same output is presented as a sparse adjacency matrix with corresponding strings and
their indices in the original vector are stored in dimnames of the adjacency matrix.
I.e. (adj_matrix[i, j]=1
) distance between
dimnames(adj_matrix)[[1]][i]
and dimnames(adj_matrix)[[1]][i]
is
cutoff
.
If drop_deg_one
is FALSE
, then dimnames(adj_matrix)[[1]] = strings
and dimnames(adj_matrix)[[2]]=1:length(strings)
.
Otherwise, dimnames(adj_matrix)[[1]] = strings
without isolated strings and dimnames(adj_matrix)[[2]]
=original indices
of strings in dimnames(adj_matrix)[[1]]
(original = index in input strings
vector).
library(RPatternJoin) library(Matrix) ## Example 1 # Consider the following example with small similar words: strings <- c("cat", "ecast", "bat", "cats", "chat") # Let's find all pairs s.t. strings can be modified # to each other with at most 2 substitutions. # For this we choose our metric to be Hamming distance and cutoff to be 2. metric <- "Hamming" cutoff <- 2 # By default we use 'partition_pattern' method # since it is the most memory efficient. method <- "partition_pattern" # Let's output the result as an adjacency matrix. output_format <- "adj_matrix" drop_deg_one <- TRUE similarityJoin( strings, cutoff, metric, method = method, drop_deg_one = drop_deg_one) # 3 x 3 sparse Matrix of class "dgCMatrix" # cat bat cats # 1 1 1 1 # 3 1 1 1 # 4 1 1 1 ## Example 2 # On the same strings, let's calculate pairs of strings with edit distance \eqn{\leq} 1. cutoff <- 1 metric <- "Levenshtein" # Let's output the result as an adjacency matrix, but drop strings without any connections. drop_deg_one <- FALSE similarityJoin( strings, cutoff, metric, method = method, drop_deg_one = drop_deg_one) # cat ecast bat cats chat # 1 1 . 1 1 1 # 2 . 1 . . . # 3 1 . 1 . . # 4 1 . . 1 . # 5 1 . . . 1 ## Example 3 # Now let's simulate a larger example. # The `edit_dist1_example` function generate a random list # of `num_strings` strings with the average string length=`avg_len`. strings <- edit_dist1_example(avg_len = 25, num_strings = 5000) # Firstly let's do it with `stringdist` package. library(stringdist) system.time({ which(stringdist::stringdistmatrix(strings, strings, "lv") <= 1, arr.ind = TRUE) })["elapsed"] # Runtime on macOS machine with 2.2 GHz i7 processor and 16GB of DDR4 RAM: # elapsed # 63.773 # Now let's do it with similarityJoin function. system.time({ similarityJoin(strings, 1, "Levenshtein", output_format = "adj_pairs") })["elapsed"] # Runtime on the same machine: # elapsed # 0.105
library(RPatternJoin) library(Matrix) ## Example 1 # Consider the following example with small similar words: strings <- c("cat", "ecast", "bat", "cats", "chat") # Let's find all pairs s.t. strings can be modified # to each other with at most 2 substitutions. # For this we choose our metric to be Hamming distance and cutoff to be 2. metric <- "Hamming" cutoff <- 2 # By default we use 'partition_pattern' method # since it is the most memory efficient. method <- "partition_pattern" # Let's output the result as an adjacency matrix. output_format <- "adj_matrix" drop_deg_one <- TRUE similarityJoin( strings, cutoff, metric, method = method, drop_deg_one = drop_deg_one) # 3 x 3 sparse Matrix of class "dgCMatrix" # cat bat cats # 1 1 1 1 # 3 1 1 1 # 4 1 1 1 ## Example 2 # On the same strings, let's calculate pairs of strings with edit distance \eqn{\leq} 1. cutoff <- 1 metric <- "Levenshtein" # Let's output the result as an adjacency matrix, but drop strings without any connections. drop_deg_one <- FALSE similarityJoin( strings, cutoff, metric, method = method, drop_deg_one = drop_deg_one) # cat ecast bat cats chat # 1 1 . 1 1 1 # 2 . 1 . . . # 3 1 . 1 . . # 4 1 . . 1 . # 5 1 . . . 1 ## Example 3 # Now let's simulate a larger example. # The `edit_dist1_example` function generate a random list # of `num_strings` strings with the average string length=`avg_len`. strings <- edit_dist1_example(avg_len = 25, num_strings = 5000) # Firstly let's do it with `stringdist` package. library(stringdist) system.time({ which(stringdist::stringdistmatrix(strings, strings, "lv") <= 1, arr.ind = TRUE) })["elapsed"] # Runtime on macOS machine with 2.2 GHz i7 processor and 16GB of DDR4 RAM: # elapsed # 63.773 # Now let's do it with similarityJoin function. system.time({ similarityJoin(strings, 1, "Levenshtein", output_format = "adj_pairs") })["elapsed"] # Runtime on the same machine: # elapsed # 0.105