| Title: | Comparison Functions for Clustering and Record Linkage |
|---|---|
| Description: | Implements functions for comparing strings, sequences and numeric vectors for clustering and record linkage applications. Supported comparison functions include: generalized edit distances for comparing sequences/strings, Monge-Elkan similarity for fuzzy comparison of token sets, and L-p distances for comparing numeric vectors. Where possible, comparison functions are implemented in C/C++ to ensure good performance. |
| Authors: | Neil Marchant [aut, cre] |
| Maintainer: | Neil Marchant <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.1.4 |
| Built: | 2026-05-26 07:03:11 UTC |
| Source: | https://github.com/ngmarchant/comparator |
Compares a pair of strings or sequences based on whether they are identical or not.
BinaryComp(score = 1, similarity = FALSE, ignore_case = FALSE)BinaryComp(score = 1, similarity = FALSE, ignore_case = FALSE)
score |
a numeric of length 1. Positive distance to return if the pair of strings/sequences are not identical. Defaults to 1.0. |
similarity |
a logical. If TRUE, similarities are returned instead of
distances. Specifically |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
If similarity = FALSE (default) the scores can be interpreted
as distances. When the comparator returns a distance of 0.0,
and when the comparator returns score.
If similarity = TRUE the scores can be interpreted as similarities.
When the comparator returns score, and when
the comparator returns 0.0.
A BinaryComp instance is returned, which is an S4 class inheriting from
StringComparator.
The Chebyshev distance (a.k.a. L-Inf distance or ) between two vectors
and is the greatest of the absolute differences between each
coordinate:
Chebyshev()Chebyshev()
A Chebyshev instance is returned, which is an S4 class inheriting
from NumericComparator.
The Chebyshev distance is a limiting case of the Minkowski
distance where .
Other numeric comparators include Manhattan, Euclidean and
Minkowski.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Chebyshev()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Chebyshev() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Chebyshev()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Chebyshev() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
This class represents a function for comparing pairs of
objects. It is the base class from which other types of comparators (e.g.
NumericComparator and StringComparator) are derived.
.Dataa function which takes a pair of arguments x and y, and
returns the elementwise scores.
symmetrica logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y) is identical to
comparator(y, x).
distancea logical of length 1. If TRUE, the comparator produces
distances and satisfies comparator(x, x) = 0. The comparator may not
satisfy all of the properties of a distance metric.
similaritya logical of length 1. If TRUE, the comparator produces
similarity scores.
tri_inequala logical of length 1. If TRUE, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE and symmetric = TRUE.
A trivial comparator that returns a constant for any pair of strings or sequences.
Constant(constant = 0)Constant(constant = 0)
constant |
a non-negative numeric vector of length 1. Defaults to zero. |
A Constant instance is returned, which is an S4 class inheriting
from StringComparator.
This class is a trait possessed by SequenceComparators that have a C++ implementation. SequenceComparators without this trait are implemented in R, and may be slower to execute.
The Damerau-Levenshtein distance between two strings/sequences
and is the minimum cost of operations (insertions, deletions,
substitutions or transpositions) required to transform
into . It differs from the Levenshtein distance by including
transpositions (swaps) among the allowable operations.
DamerauLevenshtein( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )DamerauLevenshtein( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
transposition |
positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x and y are strings in this section,
however the comparator is also implemented for more general sequences.
A Damerau-Levenshtein similarity is returned if similarity = TRUE, which
is defined as
where , are the number of characters in and
respectively, is the Damerau-Levenshtein
distance, is the cost of a deletion and is the cost of
an insertion.
Normalization of the Damerau-Levenshtein distance/similarity to the unit
interval is also supported by setting normalize = TRUE. The normalization
approach follows Yujian and Bo (2007), and ensures that the distance
remains a metric when the costs of insertion and deletion
are equal. The normalized distance
is defined as
and the normalized similarity is defined as
A DamerauLevenshtein instance is returned, which is an S4 class inheriting
from Levenshtein.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and . In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091-1095.
Other edit-based comparators include Hamming, LCS,
Levenshtein and OSA.
## The Damerau-Levenshtein distance reduces to ordinary Levenshtein distance ## when the cost of transpositions is high x <- "plauge"; y <- "plague" DamerauLevenshtein(transposition = 100)(x, y) == Levenshtein()(x, y) ## Compare car names using normalized Damerau-Levenshtein similarity data(mtcars) cars <- rownames(mtcars) pairwise(DamerauLevenshtein(similarity = TRUE, normalize=TRUE), cars) ## Compare sequences using Damerau-Levenshtein distance x <- c("G", "T", "G", "C", "T", "G", "G", "C", "C", "C", "A", "T") y <- c("G", "T", "G", "C", "G", "T", "G", "C", "C", "C", "A", "T") DamerauLevenshtein()(list(x), list(y))## The Damerau-Levenshtein distance reduces to ordinary Levenshtein distance ## when the cost of transpositions is high x <- "plauge"; y <- "plague" DamerauLevenshtein(transposition = 100)(x, y) == Levenshtein()(x, y) ## Compare car names using normalized Damerau-Levenshtein similarity data(mtcars) cars <- rownames(mtcars) pairwise(DamerauLevenshtein(similarity = TRUE, normalize=TRUE), cars) ## Compare sequences using Damerau-Levenshtein distance x <- c("G", "T", "G", "C", "T", "G", "G", "C", "C", "C", "A", "T") y <- c("G", "T", "G", "C", "G", "T", "G", "C", "C", "C", "A", "T") DamerauLevenshtein()(list(x), list(y))
Computes elementwise similarities/distances between two collections of objects (strings, vectors, etc.) using the provided comparator.
elementwise(comparator, x, y, ...) ## S4 method for signature 'CppSeqComparator,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'StringComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,matrix,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'InVocabulary,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Lookup,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'MongeElkan,list,list' elementwise(comparator, x, y, ...)elementwise(comparator, x, y, ...) ## S4 method for signature 'CppSeqComparator,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'StringComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,matrix,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'InVocabulary,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Lookup,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'MongeElkan,list,list' elementwise(comparator, x, y, ...)
comparator |
a comparator used to compare the objects, which is a
sub-class of |
x, y
|
a collection of objects to compare, typically stored as entries
in an atomic vector, rows in a matrix, or entries in a list. The required
format depends on the type of |
... |
other parameters passed on to other methods. |
Every object in x is compared to every object in y elementwise
(with recycling) using the given comparator, to produce a numeric vector of
scores of length .
comparator = CppSeqComparator,x = list,y = list: Specialization for CppSeqComparator where
x and y are lists of sequences (vectors) to compare.
comparator = StringComparator,x = vector,y = vector: Specialization for StringComparator where
x and y are vectors of strings to compare.
comparator = NumericComparator,x = matrix,y = vector: Specialization for NumericComparator where
x is a matrix of rows (interpreted as vectors) to compare with a vector
y.
comparator = NumericComparator,x = vector,y = matrix: Specialization for NumericComparator where
x is a vector to compare with a matrix y of rows (interpreted as
vectors).
comparator = NumericComparator,x = vector,y = vector: Specialization for NumericComparator where
x and y are vectors to compare.
comparator = Chebyshev,x = matrix,y = matrix: Specialization for Chebyshev where x and y
matrices of rows (interpreted as vectors) to compare. If x any y do
not have the same number of rows, rows are recycled in the smaller matrix.
comparator = FuzzyTokenSet,x = list,y = list: Specialization for FuzzyTokenSet where x and y
are lists of token vectors to compare.
comparator = InVocabulary,x = vector,y = vector: Specialization for InVocabulary where x and
y are vectors of strings to compare.
comparator = Lookup,x = vector,y = vector: Specialization for a Lookup where x and y
are vectors of strings to compare
comparator = MongeElkan,x = list,y = list: Specialization for MongeElkan where x and y
lists of token vectors to compare.
This function is not strictly necessary, as the comparator itself is a
function that returns elementwise vectors of scores. In other words,
comparator(x, y, ...) is equivalent to
elementwise(comparator, x, y, ...).
## Compute the absolute difference between two sets of scalar observations data("iris") x <- as.matrix(iris$Sepal.Width) y <- as.matrix(iris$Sepal.Length) elementwise(Euclidean(), x, y) ## Compute the edit distance between columns of two linked data.frames col.1 <- c("Hasna Yuhanna", "Korina Zenovia", "Phyllis Haywood", "Nicky Ellen") col.2 <- c("Hasna Yuhanna", "Corinna Zenovia", "Phyllis Dorothy Haywood", "Nicole Ellen") elementwise(Levenshtein(), col.1, col.2) Levenshtein()(col.1, col.2) # equivalent to above ## Recycling is used if the two collections don't contain the same number of objects elementwise(Levenshtein(), "Cora Zenovia", col.1)## Compute the absolute difference between two sets of scalar observations data("iris") x <- as.matrix(iris$Sepal.Width) y <- as.matrix(iris$Sepal.Length) elementwise(Euclidean(), x, y) ## Compute the edit distance between columns of two linked data.frames col.1 <- c("Hasna Yuhanna", "Korina Zenovia", "Phyllis Haywood", "Nicky Ellen") col.2 <- c("Hasna Yuhanna", "Corinna Zenovia", "Phyllis Dorothy Haywood", "Nicole Ellen") elementwise(Levenshtein(), col.1, col.2) Levenshtein()(col.1, col.2) # equivalent to above ## Recycling is used if the two collections don't contain the same number of objects elementwise(Levenshtein(), "Cora Zenovia", col.1)
The Euclidean distance (a.k.a. L-2 distance) between two vectors and
is the square root of the sum of the squared differences of the
Cartesian coordinates:
Euclidean()Euclidean()
A Euclidean instance is returned, which is an S4 class inheriting
from Minkowski.
The Euclidean distance is a special case of the Minkowski
distance with .
Other numeric comparators include Manhattan, Minkowski and
Chebyshev.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Euclidean()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Euclidean() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Euclidean()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Euclidean() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
Compares a pair of token sets and by computing the
optimal cost of transforming into using single-token
operations (insertions, deletions and substitutions). The cost of
single-token operations is determined at the character-level using an
internal string comparator.
FuzzyTokenSet( inner_comparator = Levenshtein(normalize = TRUE), agg_function = base::mean, deletion = 1, insertion = 1, substitution = 1 )FuzzyTokenSet( inner_comparator = Levenshtein(normalize = TRUE), agg_function = base::mean, deletion = 1, insertion = 1, substitution = 1 )
inner_comparator |
inner string distance comparator of class
|
agg_function |
function used to aggregate the costs of the optimal
operations. Defaults to |
deletion |
non-negative weight associated with deletion of a token. Defaults to 1. |
insertion |
non-negative weight associated insertion of a token. Defaults to 1. |
substitution |
non-negative weight associated with substitution of a token. Defaults to 1. |
A token set is an unordered enumeration of tokens, which may include
duplicates. Given two token sets and , this comparator
computes the optimal cost of transforming into using the
following single-token operations:
deleting a token from at cost
inserting a token in at cost
substituting a token in for a token
in at cost
where is an internal string comparator and
are non-negative weights, referred to as deletion,
insertion and substitution in the parameter list. By default, the
mean cost of the optimal set of operations is returned. Other methods of
aggregating the costs are supported by specifying a non-default
agg_function.
If the internal string comparator is a distance function, then the optimal set of operations minimize the cost. Otherwise, the optimal set of operations maximize the cost. The optimization problem is solved exactly using a linear sum assignment solver.
This comparator is qualitatively similar to the MongeElkan
comparator, however it is arguably more principled, since it is formulated
as a cost optimization problem. It also offers more control over the costs
of missing tokens (by varying the deletion and insertion weights).
This is useful for comparing full names, when dropping a name (e.g.
middle name) shouldn't be severely penalized.
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') FuzzyTokenSet()(x, y) # Reduce the cost associated with missing words FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y) ## Compare full name with abbreviated name, reducing the penalty ## for dropping parts of the name fullname <- "JOSE ELIAS TEJADA BASQUES" name <- "JOSE BASQUES" # Tokenize strings on white space fullname <- strsplit(fullname, '\\s+') name <- strsplit(name, '\\s+') comparator <- FuzzyTokenSet(deletion = 0.5) comparator(fullname, name) < comparator(name, fullname) # TRUE## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') FuzzyTokenSet()(x, y) # Reduce the cost associated with missing words FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y) ## Compare full name with abbreviated name, reducing the penalty ## for dropping parts of the name fullname <- "JOSE ELIAS TEJADA BASQUES" name <- "JOSE BASQUES" # Tokenize strings on white space fullname <- strsplit(fullname, '\\s+') name <- strsplit(name, '\\s+') comparator <- FuzzyTokenSet(deletion = 0.5) comparator(fullname, name) < comparator(name, fullname) # TRUE
Geometric Mean
gmean(x, ...) ## Default S3 method: gmean(x, na.rm = FALSE, ...)gmean(x, ...) ## Default S3 method: gmean(x, na.rm = FALSE, ...)
x |
An R object. Currently there are methods for numeric/logical
vectors and date, date-time and time interval objects. Complex vectors
are allowed for |
... |
further arguments passed to or from other methods. |
na.rm |
a logical value indicating whether |
The geometric mean of the values in x is computed, as a numeric
or complex vector of length one. If x is not logical (coerced to
numeric), numeric (including integer) or complex, NA_real_ is returned,
with a warning.
mean for the arithmetic mean and hmean for the harmonic
mean.
x <- c(1:10, 50) xm <- gmean(x)x <- c(1:10, 50) xm <- gmean(x)
The Hamming distance between two strings/sequences of equal length is the number of positions where the corresponding characters/sequence elements differ. It can be viewed as a type of edit distance where the only permitted operation is substitution of characters/sequence elements.
Hamming( normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )Hamming( normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
normalize |
a logical. If TRUE, distances/similarities are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
When the input strings/sequences and are of
different lengths (), the Hamming distance
is defined to be .
A Hamming similarity is returned if similarity = TRUE. When
the similarity is defined as follows:
where is the Hamming similarity and is the Hamming
distance. When the similarity is defined to
be 0.
Normalization of the Hamming distance/similarity to the unit interval is
also supported by setting normalize = TRUE. The raw distance/similarity
is divided by the length of the string/sequence . If
the normalized distance is defined to be 1,
while the normalized similarity is defined to be 0.
A Hamming instance is returned, which is an S4 class inheriting from
StringComparator.
While the unnormalized Hamming distance is a metric, the normalized variant is not as it does not satisfy the triangle inequality.
Other edit-based comparators include LCS, Levenshtein,
OSA and DamerauLevenshtein.
## Compare US ZIP codes x <- "90001" y <- "90209" m1 <- Hamming() # unnormalized distance m2 <- Hamming(similarity = TRUE, normalize = TRUE) # normalized similarity m1(x, y) m2(x, y)## Compare US ZIP codes x <- "90001" y <- "90209" m1 <- Hamming() # unnormalized distance m2 <- Hamming(similarity = TRUE, normalize = TRUE) # normalized similarity m1(x, y) m2(x, y)
Harmonic Mean
hmean(x, ...) ## Default S3 method: hmean(x, trim = 0, na.rm = FALSE, ...)hmean(x, ...) ## Default S3 method: hmean(x, trim = 0, na.rm = FALSE, ...)
x |
An R object. Currently there are methods for numeric/logical
vectors and date, date-time and time interval objects. Complex vectors
are allowed for |
... |
further arguments passed to or from other methods. |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each
end of |
na.rm |
a logical value indicating whether |
If trim is zero (the default), the harmonic mean of the values
in x is computed, as a numeric or complex vector of length one. If x
is not logical (coerced to numeric), numeric (including integer) or
complex, NA_real_ is returned, with a warning.
If trim is non-zero, a symmetrically trimmed mean is computed with a
fraction of trim observations deleted from each end before the mean
is computed.
mean for the arithmetic mean and gmean for the geometric
mean.
x <- c(1:10, 50) xm <- hmean(x)x <- c(1:10, 50) xm <- hmean(x)
Compares a pair of strings and using a reference vocabulary.
Different scores are returned depending on whether both/one/neither of
and are in the reference vocabulary.
InVocabulary( vocab, both_in_distinct = 0.7, both_in_same = 1, one_in = 1, none_in = 1, ignore_case = FALSE )InVocabulary( vocab, both_in_distinct = 0.7, both_in_same = 1, one_in = 1, none_in = 1, ignore_case = FALSE )
vocab |
a vector containing in-vocabulary (known) strings. Any strings not in this vector are out-of-vocabulary (unknown). |
both_in_distinct |
score to return if the pair of values being
compared are both in |
both_in_same |
score to return if the pair of values being
compared are both in |
one_in |
score to return if only one of the pair of values being
compared is in |
none_in |
score to return if none of the pair of values being
compared is in |
ignore_case |
a logical. If TRUE, case is ignored when comparing the strings. |
This comparator is not intended to produce useful scores on its own. Rather, it is intended to produce multiplicative factors which can be applied to other similarity/distance scores. It is particularly useful for comparing names when a reference list (vocabulary) of known names is available. For example, it can be configured to down-weight the similarity scores of distinct (known) names like "Roberto" and "Umberto" which are semantically very different, but deceptively similar in terms of edit distance. The normalized Levenshtein similarity for these two names is 75%, but their similarity can be reduced to 53% if multiplied by the score from this comparator using the default settings.
An InVocabulary instance is returned, which is an S4 class inheriting from
StringComparator.
## Compare names with possible typos using a reference of known names known_names <- c("Roberto", "Umberto", "Alberto", "Emberto", "Norberto", "Humberto") m1 <- InVocabulary(known_names) m2 <- Levenshtein(similarity = TRUE, normalize = TRUE) x <- "Emberto" y <- c("Enberto", "Umberto") # "Emberto" and "Umberto" are likely to refer to distinct people (since # they are known distinct names) so their Levenshtein similarity is # downweighted to 0.61. "Emberto" and "Enberto" may refer to the same # person (likely typo), so their Levenshtein similarity of 0.87 is not # downweighted. similarities <- m1(x, y) * m2(x, y)## Compare names with possible typos using a reference of known names known_names <- c("Roberto", "Umberto", "Alberto", "Emberto", "Norberto", "Humberto") m1 <- InVocabulary(known_names) m2 <- Levenshtein(similarity = TRUE, normalize = TRUE) x <- "Emberto" y <- c("Enberto", "Umberto") # "Emberto" and "Umberto" are likely to refer to distinct people (since # they are known distinct names) so their Levenshtein similarity is # downweighted to 0.61. "Emberto" and "Enberto" may refer to the same # person (likely typo), so their Levenshtein similarity of 0.87 is not # downweighted. similarities <- m1(x, y) * m2(x, y)
Compares a pair of strings/sequences x and y based on the number of
greedily-aligned characters/sequence elements and the number of
transpositions. It was developed for comparing names at the U.S. Census
Bureau.
Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)
similarity |
a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details). |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x and y are strings in this section,
however the comparator is also implemented for more general sequences.
When similarity = TRUE (default), the Jaro similarity is computed as
where is the number of "matching" characters (defined below),
is the number of "transpositions", and are the
lengths of the strings and . The similarity takes on values
in the range , where 1 corresponds to a perfect match.
The number of "matching" characters is computed using a greedy
alignment algorithm. The algorithm iterates over the characters in ,
attempting to align the -th character with the first
matching character in . When looking for matching characters in
, the algorithm only considers previously un-matched characters
within a window
where .
The alignment process yields a subsequence of matching characters from
and . The number of "transpositions" is defined to
be the number of positions in the subsequence of which are
misaligned with the corresponding position in .
When similarity = FALSE, the Jaro distance is computed as
A Jaro instance is returned, which is an S4 class inheriting from
StringComparator.
The Jaro distance is not a metric, as it does not satisfy the
identity axiom
Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.
The JaroWinkler comparator modifies the Jaro comparator by
boosting the similarity score for strings/sequences that have matching
prefixes.
## Compare names Jaro()("Martha", "Mathra") Jaro()("Eileen", "Phyllis")## Compare names Jaro()("Martha", "Mathra") Jaro()("Eileen", "Phyllis")
The Jaro-Winkler comparator is a variant of the Jaro comparator which
boosts the similarity score for strings/sequences with matching prefixes.
It was developed for comparing names at the U.S. Census Bureau.
JaroWinkler( p = 0.1, threshold = 0.7, max_prefix = 4L, similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE )JaroWinkler( p = 0.1, threshold = 0.7, max_prefix = 4L, similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE )
p |
a non-negative numeric scalar no larger than 1/max_prefix. Similarity scores eligible for boosting are scaled by this factor. |
threshold |
a numeric scalar on the unit interval. Jaro similarities greater than this value are boosted based on matching characters in the prefixes of both strings. Jaro similarities below this value are returned unadjusted. Defaults to 0.7. |
max_prefix |
a non-negative integer scalar, specifying the size of the prefix to consider for boosting. Defaults to 4 (characters). |
similarity |
a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details). |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x and y are strings in this section,
however the comparator is also implemented for more general sequences.
The Jaro-Winkler similarity (computed when similarity = TRUE) is
defined in terms of the Jaro similarity. If the Jaro similarity
between strings and exceeds a
user-specified threshold ,
the similarity score is boosted in proportion to the number of matching
characters in the prefixes of and . More precisely, the
Jaro-Winkler similarity is defined as:
where is the length of the common prefix,
is a user-specified upper bound on the prefix size, and
is a scaling factor.
The Jaro-Winkler distance is computed when similarity = FALSE and is
defined as
A JaroWinkler instance is returned, which is an S4 class inheriting from
StringComparator.
Like the Jaro distance, the Jaro-Winkler distance is not a metric as it does not satisfy the identity axiom.
Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.
Winkler, W. E. (2006), "Overview of Record Linkage and Current Research Directions", Tech. report. Statistics #2006-2. Statistical Research Division, U.S. Census Bureau.
Winkler, W., McLaughlin G., Jaro M. and Lynch M. (1994), strcmp95.c, Version 2. United States Census Bureau.
This comparator reduces to the Jaro comparator when max_prefix = 0L
or threshold = 0.0.
## Compare names JaroWinkler()("Martha", "Mathra") JaroWinkler()("Eileen", "Phyllis") ## Reduce the threshold for boosting x <- "Matthew" y <- "Martin" JaroWinkler()(x, y) < JaroWinkler(threshold = 0.5)(x, y)## Compare names JaroWinkler()("Martha", "Mathra") JaroWinkler()("Eileen", "Phyllis") ## Reduce the threshold for boosting x <- "Matthew" y <- "Martin" JaroWinkler()(x, y) < JaroWinkler(threshold = 0.5)(x, y)
The Longest Common Subsequence (LCS) distance between two
strings/sequences and is the minimum cost of operations
(insertions and deletions) required to transform into .
The LCS similarity is more commonly used, which can be interpreted as the
length of the longest subsequence common to and .
LCS( deletion = 1, insertion = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )LCS( deletion = 1, insertion = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x and y are strings in this section,
however the comparator is also implemented for more general sequences.
An LCS similarity is returned if similarity = TRUE, which
is defined as
where , are the number of characters in and
respectively, is the LCS distance,
is the cost of a deletion and is the cost of an insertion.
Normalization of the LCS distance/similarity to the unit interval
is also supported by setting normalize = TRUE. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion and deletion are equal.
The normalized distance is defined as
and the normalized similarity is defined as
A LCS instance is returned, which is an S4 class inheriting from
StringComparator.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and . In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
Bergroth, L., Hakonen, H., & Raita, T. (2000), "A survey of longest common subsequence algorithms", Proceedings Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00), 39-48.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.
Other edit-based comparators include Hamming, Levenshtein,
OSA and DamerauLevenshtein.
## There are no common substrings of size 3 for the following example, ## however there are two common substrings of size 2: "AC" and "BC". ## Hence the LCS similarity is 2. x <- "ABCDA"; y <- "BAC" LCS(similarity = TRUE)(x, y) ## Levenshtein distance reduces to LCS distance when the cost of ## substitution is high x <- "ABC"; y <- "AAA" LCS()(x, y) == Levenshtein(substitution = 100)(x, y)## There are no common substrings of size 3 for the following example, ## however there are two common substrings of size 2: "AC" and "BC". ## Hence the LCS similarity is 2. x <- "ABCDA"; y <- "BAC" LCS(similarity = TRUE)(x, y) ## Levenshtein distance reduces to LCS distance when the cost of ## substitution is high x <- "ABC"; y <- "AAA" LCS()(x, y) == Levenshtein(substitution = 100)(x, y)
The Levenshtein (edit) distance between two strings/sequences and
is the minimum cost of operations (insertions, deletions or
substitutions) required to transform into .
Levenshtein( deletion = 1, insertion = 1, substitution = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )Levenshtein( deletion = 1, insertion = 1, substitution = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x and y are strings in this section,
however the comparator is also implemented for more general sequences.
A Levenshtein similarity is returned if similarity = TRUE, which
is defined as
where , are the number of characters in and
respectively, is the Levenshtein distance,
is the cost of a deletion and is the cost of an
insertion.
Normalization of the Levenshtein distance/similarity to the unit interval
is also supported by setting normalize = TRUE. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion and deletion are equal.
The normalized distance is defined as
and the normalized similarity is defined as
A Levenshtein instance is returned, which is an S4 class inheriting from
StringComparator.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and . In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.
Other edit-based comparators include Hamming, LCS,
OSA and DamerauLevenshtein.
## Compare names with potential typos x <- c("Brian Cheng", "Bryan Cheng", "Kondo Onyejekwe", "Condo Onyejekve") pairwise(Levenshtein(), x, return_matrix = TRUE) ## When the substitution cost is high, Levenshtein distance reduces to LCS distance Levenshtein(substitution = 100)("Iran", "Iraq") == LCS()("Iran", "Iraq")## Compare names with potential typos x <- c("Brian Cheng", "Bryan Cheng", "Kondo Onyejekwe", "Condo Onyejekve") pairwise(Levenshtein(), x, return_matrix = TRUE) ## When the substitution cost is high, Levenshtein distance reduces to LCS distance Levenshtein(substitution = 100)("Iran", "Iraq") == LCS()("Iran", "Iraq")
Compares a pair of strings and by retrieving
their distance/similarity score from a provided lookup table.
Lookup( lookup_table, values_colnames, score_colname, default_match = 0, default_nonmatch = NA_real_, symmetric = TRUE, ignore_case = FALSE )Lookup( lookup_table, values_colnames, score_colname, default_match = 0, default_nonmatch = NA_real_, symmetric = TRUE, ignore_case = FALSE )
lookup_table |
data frame containing distances/similarities for pairs of values |
values_colnames |
character vector containing the colnames
corresponding to pairs of values (e.g. strings) in |
score_colname |
name of column that contains distances/similarities
in |
default_match |
distance/similarity to use if the pair of values
match exactly and do not appear in |
default_nonmatch |
distance/similarity to use if the pair of values are
not an exact match and do not appear in |
symmetric |
whether the underlying distance/similarity scores are
symmetric. If TRUE |
ignore_case |
a logical. If TRUE, case is ignored when comparing the strings. |
The lookup table should contain three columns corresponding to ,
and (values_colnames below) and the distance/similarity
(score_colname below). If a pair of values and is
not in the lookup table, a default distance/similarity is returned
depending on whether (default_match below) or
(default_nonmatch below).
A Lookup instance is returned, which is an S4 class inheriting from
StringComparator.
## Measure the distance between cities lookup_table <- data.frame(x = c("Melbourne", "Melbourne", "Sydney"), y = c("Sydney", "Brisbane", "Brisbane"), dist = c(713.4, 1374.8, 732.5)) comparator <- Lookup(lookup_table, c("x", "y"), "dist") comparator("Sydney", "Melbourne") comparator("Melbourne", "Perth")## Measure the distance between cities lookup_table <- data.frame(x = c("Melbourne", "Melbourne", "Sydney"), y = c("Sydney", "Brisbane", "Brisbane"), dist = c(713.4, 1374.8, 732.5)) comparator <- Lookup(lookup_table, c("x", "y"), "dist") comparator("Sydney", "Melbourne") comparator("Melbourne", "Perth")
The Manhattan distance (a.k.a. L-1 distance) between two vectors and
is the sum of the absolute differences of their Cartesian
coordinates:
Manhattan()Manhattan()
A Manhattan instance is returned, which is an S4 class inheriting
from Minkowski.
The Manhattan distance is a special case of the Minkowski
distance with .
Other numeric comparators include Euclidean, Minkowski and
Chebyshev.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Manhattan()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Manhattan() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Manhattan()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Manhattan() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
The Minkowski distance (a.k.a. L-p distance) between two vectors and
is the p-th root of the sum of the absolute differences of their
Cartesian coordinates raised to the p-th power:
Minkowski(p = 2)Minkowski(p = 2)
p |
a positive numeric specifying the order of the distance. Defaults
to 2 (Euclidean distance). If |
A Minkowski instance is returned, which is an S4 class inheriting
from NumericComparator.
Other numeric comparators include Manhattan, Euclidean and
Chebyshev.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Minkowski()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Minkowski() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Minkowski()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Minkowski() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
Compares a pair of token sets and by computing similarity
scores between all pairs of tokens using an internal string comparator,
then taking the mean of the maximum scores for each token in .
MongeElkan( inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE), agg_function = base::mean, symmetrize = FALSE )MongeElkan( inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE), agg_function = base::mean, symmetrize = FALSE )
inner_comparator |
internal string comparator of class
|
agg_function |
aggregation function to use when aggregating internal
similarities/distances between tokens. Defaults to |
symmetrize |
logical indicating whether to use a symmetrized version of the Monge-Elkan comparator. Defaults to FALSE. |
A token set is an unordered enumeration of tokens, which may include
duplicates.
Given two token sets and , the Monge-Elkan comparator is
defined as:
where is the i-th token in , is the
number of tokens in and is an internal
string similarity comparator.
A generalization of the original Monge-Elkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized Monge-Elkan comparator is defined as:
where is an internal distance or similarity
comparator, is if
is a similarity comparator or if
it is a distance comparator, and is an aggregation
function which takes a vector of scores for each token in and
returns a scalar.
By default, the Monge-Elkan comparator is asymmetric in its arguments
and . If symmetrize = TRUE, a symmetric version of the comparator
is obtained as follows
where is defined above.
A MongeElkan instance is returned, which is an S4 class inheriting from
StringComparator.
Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267-270.
Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized Monge-Elkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559-570.
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') MongeElkan()(x, y) ## The symmetrized variant is arguably more appropriate for this example MongeElkan(symmetrize = TRUE)(x, y) ## Using a different internal comparator changes the result MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') MongeElkan()(x, y) ## The symmetrized variant is arguably more appropriate for this example MongeElkan(symmetrize = TRUE)(x, y) ## Using a different internal comparator changes the result MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)
Represents a comparator for comparing pairs of numeric vectors.
.Dataa function that calls the elementwise method for this class,
with arguments x, y and ....
symmetrica logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y) is identical to
comparator(y, x).
distancea logical of length 1. If TRUE, the comparator produces
distances and satisfies comparator(x, x) = 0. The comparator may not
satisfy all of the properties of a distance metric.
similaritya logical of length 1. If TRUE, the comparator produces
similarity scores.
tri_inequala logical of length 1. If TRUE, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE and symmetric = TRUE.
The Optimal String Alignment (OSA) distance between two strings/sequences
and is the minimum cost of operations (insertions,
deletions, substitutions or transpositions) required to transform
into , subject to the constraint that no substring/subsequence is
edited more than once.
OSA( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )OSA( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
transposition |
positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x and y are strings in this section,
however the comparator is also implemented for more general sequences.
An OSA similarity is returned if similarity = TRUE, which
is defined as
where , are the number of characters in and
respectively, is the OSA distance,
is the cost of a deletion and is the cost of an insertion.
Normalization of the OSA distance/similarity to the unit interval
is also supported by setting normalize = TRUE. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion and deletion are equal.
The normalized distance is defined as
and the normalized similarity is defined as
An OSA instance is returned, which is an S4 class inheriting from
StringComparator.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and . The OSA distance is not a proper metric
as it does not satisfy the triangle inequality. The Damerau-Levenshtein
distance is closely related—it allows the same edit operations as OSA,
but removes the requirement that no substring can be edited more than once.
Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 1091–1095.
Other edit-based comparators include Hamming, LCS,
Levenshtein and DamerauLevenshtein.
## Compare strings with a transposition error x <- "plauge"; y <- "plague" OSA()(x, y) != Levenshtein()(x, y) ## Unlike Damerau-Levenshtein, OSA does not allow a substring to be ## edited more than once x <- "ABC"; y <- "CA" OSA()(x, y) != DamerauLevenshtein()(x, y) ## Compare car names using normalized OSA similarity data(mtcars) cars <- rownames(mtcars) pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)## Compare strings with a transposition error x <- "plauge"; y <- "plague" OSA()(x, y) != Levenshtein()(x, y) ## Unlike Damerau-Levenshtein, OSA does not allow a substring to be ## edited more than once x <- "ABC"; y <- "CA" OSA()(x, y) != DamerauLevenshtein()(x, y) ## Compare car names using normalized OSA similarity data(mtcars) cars <- rownames(mtcars) pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)
Computes pairwise similarities/distances between two collections of objects (strings, vectors, etc.) using the provided comparator.
pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Comparator,ANY,missing' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,matrix,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,vector,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'Minkowski,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...)pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Comparator,ANY,missing' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,matrix,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,vector,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'Minkowski,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...)
comparator |
a comparator used to compare the objects, which is a
sub-class of |
x, y
|
a collection of objects to compare, typically stored as entries
in an atomic vector, rows in a matrix, or entries in a list. The required
format depends on the type of |
return_matrix |
a logical of length 1. If FALSE (default), the pairwise
similarities/distances will be returned as a |
... |
other parameters passed on to other methods. |
If both x and y are specified, every object in x is compared with
every object in y using the comparator, and the resulting scores are
returned in a size(x) by size(y) matrix.
If only x is specified, then the objects in x are compared with
themselves using the comparator, and the resulting scores are returned in a
size(x) by size(y) matrix.
By default, the matrix is represented as an instance of the
PairwiseMatrix class, which is more space-efficient for symmetric
comparators when y is not specified. However, if return_matrix = TRUE,
the matrix is returned as an ordinary matrix instead.
comparator = Comparator,x = ANY,y = missing: Compute a pairwise comparator when y
comparator = CppSeqComparator,x = list,y = list: Specialization for CppSeqComparator where x
and y are lists of sequences (vectors) to compare.
comparator = CppSeqComparator,x = list,y = NULL: Specialization for CppSeqComparator where x is
a list of sequences (vectors) to compare.
comparator = StringComparator,x = vector,y = vector: Specialization for StringComparator where x
and y are vectors of strings to compare.
comparator = StringComparator,x = vector,y = NULL: Specialization for StringComparator where x
is a vector of strings to compare.
comparator = NumericComparator,x = matrix,y = vector: Specialization for NumericComparator where x
is a matrix of rows (interpreted as vectors) to compare with a vector y.
comparator = NumericComparator,x = vector,y = matrix: Specialization for NumericComparator where x
is a vector to compare with a matrix y of rows (interpreted as vectors).
comparator = Chebyshev,x = matrix,y = matrix: Specialization for Chebyshev where x and y
matrices of rows (interpreted as vectors) to compare.
comparator = Chebyshev,x = matrix,y = NULL: Specialization for Minkowski where x is a matrix
of rows (interpreted as vectors) to compare among themselves.
comparator = Minkowski,x = matrix,y = matrix: Specialization for a Minkowski where x and y
matrices of rows (interpreted as vectors) to compare.
comparator = Minkowski,x = matrix,y = matrix: Specialization for a Minkowski where x and y
matrices of rows (interpreted as vectors) to compare.
comparator = Minkowski,x = matrix,y = NULL: Specialization for Minkowski where x is a matrix
of rows (interpreted as vectors) to compare among themselves.
comparator = FuzzyTokenSet,x = list,y = list: Specialization for FuzzyTokenSet where x and y are
lists of token vectors to compare.
comparator = FuzzyTokenSet,x = vector,y = NULL: Specialization for FuzzyTokenSet where x is a list of token
vectors to compare among themselves.
comparator = InVocabulary,x = vector,y = vector: Specialization for InVocabulary where x and y
are vectors of strings to compare.
comparator = InVocabulary,x = vector,y = NULL: Specialization for InVocabulary where x is a
vector of strings to compare among themselves.
comparator = Lookup,x = vector,y = vector: Specialization for a Lookup where x and y are
vectors of strings to compare
comparator = Lookup,x = vector,y = NULL: Specialization for Lookup where x is a vector of
strings to compare among themselves
comparator = MongeElkan,x = list,y = list: Specialization for MongeElkan where x and y are
lists of token vectors to compare.
comparator = MongeElkan,x = list,y = NULL: Specialization for MongeElkan where x is a list
of token vectors to compare among themselves.
## Computing the distances between a query point y (a 3D numeric vector) ## and a set of reference points x x <- rbind(c(1,0,1), c(0,0,0), c(-1,2,-1)) y <- c(10, 5, 10) pairwise(Manhattan(), x, y) ## Computing the pairwise similarities among a set of strings x <- c("Benjamin", "Ben", "Benny", "Bne", "Benedict", "Benson") comparator <- DamerauLevenshtein(similarity = TRUE, normalize = TRUE) pairwise(comparator, x, return_matrix = TRUE) # return an ordinary matrix## Computing the distances between a query point y (a 3D numeric vector) ## and a set of reference points x x <- rbind(c(1,0,1), c(0,0,0), c(-1,2,-1)) y <- c(10, 5, 10) pairwise(Manhattan(), x, y) ## Computing the pairwise similarities among a set of strings x <- c("Benjamin", "Ben", "Benny", "Bne", "Benedict", "Benson") comparator <- DamerauLevenshtein(similarity = TRUE, normalize = TRUE) pairwise(comparator, x, return_matrix = TRUE) # return an ordinary matrix
Represents a pairwise similarity or distance matrix.
as.PairwiseMatrix(x, ...) ## S4 method for signature 'matrix' as.PairwiseMatrix(x, ...) ## S4 method for signature 'PairwiseMatrix' as.matrix(x, ...)as.PairwiseMatrix(x, ...) ## S4 method for signature 'matrix' as.PairwiseMatrix(x, ...) ## S4 method for signature 'PairwiseMatrix' as.matrix(x, ...)
x |
an R object. |
... |
additional arguments to be passed to methods. |
If the elements being compared are from the same set, the matrix may be symmetric if the comparator is symmetric. In this case, entries in the upper triangle and/or along the diagonal may not be stored in memory, since they are redundant.
as.PairwiseMatrix: Convert an R object x to a PairwiseMatrix.
as.PairwiseMatrix,matrix-method: Convert an ordinary matrix x to a PairwiseMatrix.
as.matrix,PairwiseMatrix-method: Convert a PairwiseMatrix x to an ordinary matrix.
.Dataentries of the matrix in column-major order. Entries in the upper triangle and/or on the diagonal may be omitted.
Diminteger vector of length 2. The dimensions of the matrix.
Diaglogical indicating whether the diagonal entries are stored in
.Data.
Represents a comparator for pairs of sequences.
.Dataa function that calls the elementwise method for this class,
with arguments x, y and ....
symmetrica logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y) is identical to
comparator(y, x).
distancea logical of length 1. If TRUE, the comparator produces
distances and satisfies comparator(x, x) = 0. The comparator may not
satisfy all of the properties of a distance metric.
similaritya logical of length 1. If TRUE, the comparator produces
similarity scores.
tri_inequala logical of length 1. If TRUE, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE and symmetric = TRUE.
Represents a comparator for pairs of strings.
.Dataa function that calls the elementwise method for this class,
with arguments x, y and ....
symmetrica logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y) is identical to
comparator(y, x).
distancea logical of length 1. If TRUE, the comparator produces
distances and satisfies comparator(x, x) = 0. The comparator may not
satisfy all of the properties of a distance metric.
similaritya logical of length 1. If TRUE, the comparator produces
similarity scores.
tri_inequala logical of length 1. If TRUE, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE and symmetric = TRUE.
ignore_casea logical of length 1. If TRUE, case is ignored when comparing strings. Defaults to FALSE.
use_bytesa logical of length 1. If TRUE, strings are compared byte-by-byte rather than character-by-character.
Represents a comparator for pairs of token sequences.
.Dataa function that calls the elementwise method for this class,
with arguments x, y and ....
symmetrica logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y) is identical to
comparator(y, x).
distancea logical of length 1. If TRUE, the comparator produces
distances and satisfies comparator(x, x) = 0. The comparator may not
satisfy all of the properties of a distance metric.
similaritya logical of length 1. If TRUE, the comparator produces
similarity scores.
tri_inequala logical of length 1. If TRUE, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE and symmetric = TRUE.
ordereda logical of length 1. If TRUE, the comparator treats token sequences as ordered, otherwise they are treated as unordered.