Title: | Comparison Functions for Clustering and Record Linkage |
---|---|
Description: | Implements functions for comparing strings, sequences and numeric vectors for clustering and record linkage applications. Supported comparison functions include: generalized edit distances for comparing sequences/strings, Monge-Elkan similarity for fuzzy comparison of token sets, and L-p distances for comparing numeric vectors. Where possible, comparison functions are implemented in C/C++ to ensure good performance. |
Authors: | Neil Marchant [aut, cre] |
Maintainer: | Neil Marchant <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.2 |
Built: | 2025-03-02 04:03:37 UTC |
Source: | https://github.com/ngmarchant/comparator |
Compares a pair of strings or sequences based on whether they are identical or not.
BinaryComp(score = 1, similarity = FALSE, ignore_case = FALSE)
BinaryComp(score = 1, similarity = FALSE, ignore_case = FALSE)
score |
a numeric of length 1. Positive distance to return if the pair of strings/sequences are not identical. Defaults to 1.0. |
similarity |
a logical. If TRUE, similarities are returned instead of
distances. Specifically |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
If similarity = FALSE
(default) the scores can be interpreted
as distances. When the comparator returns a distance of 0.0,
and when
the comparator returns
score
.
If similarity = TRUE
the scores can be interpreted as similarities.
When the comparator returns
score
, and when
the comparator returns 0.0.
A BinaryComp
instance is returned, which is an S4 class inheriting from
StringComparator
.
The Chebyshev distance (a.k.a. L-Inf distance or ) between two vectors
and
is the greatest of the absolute differences between each
coordinate:
Chebyshev()
Chebyshev()
A Chebyshev
instance is returned, which is an S4 class inheriting
from NumericComparator
.
The Chebyshev distance is a limiting case of the Minkowski
distance where .
Other numeric comparators include Manhattan
, Euclidean
and
Minkowski
.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Chebyshev()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Chebyshev() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Chebyshev()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Chebyshev() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
This class represents a function for comparing pairs of
objects. It is the base class from which other types of comparators (e.g.
NumericComparator
and StringComparator
) are derived.
.Data
a function which takes a pair of arguments x
and y
, and
returns the elementwise scores.
symmetric
a logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y)
is identical to
comparator(y, x)
.
distance
a logical of length 1. If TRUE
, the comparator produces
distances and satisfies comparator(x, x) = 0
. The comparator may not
satisfy all of the properties of a distance metric.
similarity
a logical of length 1. If TRUE
, the comparator produces
similarity scores.
tri_inequal
a logical of length 1. If TRUE
, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE
and symmetric = TRUE
.
A trivial comparator that returns a constant for any pair of strings or sequences.
Constant(constant = 0)
Constant(constant = 0)
constant |
a non-negative numeric vector of length 1. Defaults to zero. |
A Constant
instance is returned, which is an S4 class inheriting
from StringComparator
.
This class is a trait possessed by SequenceComparators that have a C++ implementation. SequenceComparators without this trait are implemented in R, and may be slower to execute.
The Damerau-Levenshtein distance between two strings/sequences
and
is the minimum cost of operations (insertions, deletions,
substitutions or transpositions) required to transform
into
. It differs from the Levenshtein distance by including
transpositions (swaps) among the allowable operations.
DamerauLevenshtein( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
DamerauLevenshtein( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
transposition |
positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
A Damerau-Levenshtein similarity is returned if similarity = TRUE
, which
is defined as
where ,
are the number of characters in
and
respectively,
is the Damerau-Levenshtein
distance,
is the cost of a deletion and
is the cost of
an insertion.
Normalization of the Damerau-Levenshtein distance/similarity to the unit
interval is also supported by setting normalize = TRUE
. The normalization
approach follows Yujian and Bo (2007), and ensures that the distance
remains a metric when the costs of insertion and deletion
are equal. The normalized distance
is defined as
and the normalized similarity is defined as
A DamerauLevenshtein
instance is returned, which is an S4 class inheriting
from Levenshtein
.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and
. In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091-1095.
Other edit-based comparators include Hamming
, LCS
,
Levenshtein
and OSA
.
## The Damerau-Levenshtein distance reduces to ordinary Levenshtein distance ## when the cost of transpositions is high x <- "plauge"; y <- "plague" DamerauLevenshtein(transposition = 100)(x, y) == Levenshtein()(x, y) ## Compare car names using normalized Damerau-Levenshtein similarity data(mtcars) cars <- rownames(mtcars) pairwise(DamerauLevenshtein(similarity = TRUE, normalize=TRUE), cars) ## Compare sequences using Damerau-Levenshtein distance x <- c("G", "T", "G", "C", "T", "G", "G", "C", "C", "C", "A", "T") y <- c("G", "T", "G", "C", "G", "T", "G", "C", "C", "C", "A", "T") DamerauLevenshtein()(list(x), list(y))
## The Damerau-Levenshtein distance reduces to ordinary Levenshtein distance ## when the cost of transpositions is high x <- "plauge"; y <- "plague" DamerauLevenshtein(transposition = 100)(x, y) == Levenshtein()(x, y) ## Compare car names using normalized Damerau-Levenshtein similarity data(mtcars) cars <- rownames(mtcars) pairwise(DamerauLevenshtein(similarity = TRUE, normalize=TRUE), cars) ## Compare sequences using Damerau-Levenshtein distance x <- c("G", "T", "G", "C", "T", "G", "G", "C", "C", "C", "A", "T") y <- c("G", "T", "G", "C", "G", "T", "G", "C", "C", "C", "A", "T") DamerauLevenshtein()(list(x), list(y))
Computes elementwise similarities/distances between two collections of objects (strings, vectors, etc.) using the provided comparator.
elementwise(comparator, x, y, ...) ## S4 method for signature 'CppSeqComparator,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'StringComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,matrix,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'InVocabulary,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Lookup,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'MongeElkan,list,list' elementwise(comparator, x, y, ...)
elementwise(comparator, x, y, ...) ## S4 method for signature 'CppSeqComparator,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'StringComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,matrix,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'NumericComparator,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' elementwise(comparator, x, y, ...) ## S4 method for signature 'InVocabulary,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'Lookup,vector,vector' elementwise(comparator, x, y, ...) ## S4 method for signature 'MongeElkan,list,list' elementwise(comparator, x, y, ...)
comparator |
a comparator used to compare the objects, which is a
sub-class of |
x , y
|
a collection of objects to compare, typically stored as entries
in an atomic vector, rows in a matrix, or entries in a list. The required
format depends on the type of |
... |
other parameters passed on to other methods. |
Every object in x
is compared to every object in y
elementwise
(with recycling) using the given comparator, to produce a numeric vector of
scores of length .
comparator = CppSeqComparator,x = list,y = list
: Specialization for CppSeqComparator
where
x
and y
are lists of sequences (vectors) to compare.
comparator = StringComparator,x = vector,y = vector
: Specialization for StringComparator
where
x
and y
are vectors of strings to compare.
comparator = NumericComparator,x = matrix,y = vector
: Specialization for NumericComparator
where
x
is a matrix of rows (interpreted as vectors) to compare with a vector
y
.
comparator = NumericComparator,x = vector,y = matrix
: Specialization for NumericComparator
where
x
is a vector to compare with a matrix y
of rows (interpreted as
vectors).
comparator = NumericComparator,x = vector,y = vector
: Specialization for NumericComparator
where
x
and y
are vectors to compare.
comparator = Chebyshev,x = matrix,y = matrix
: Specialization for Chebyshev
where x
and y
matrices of rows (interpreted as vectors) to compare. If x
any y
do
not have the same number of rows, rows are recycled in the smaller matrix.
comparator = FuzzyTokenSet,x = list,y = list
: Specialization for FuzzyTokenSet
where x
and y
are lists of token vectors to compare.
comparator = InVocabulary,x = vector,y = vector
: Specialization for InVocabulary
where x
and
y
are vectors of strings to compare.
comparator = Lookup,x = vector,y = vector
: Specialization for a Lookup
where x
and y
are vectors of strings to compare
comparator = MongeElkan,x = list,y = list
: Specialization for MongeElkan
where x
and y
lists of token vectors to compare.
This function is not strictly necessary, as the comparator
itself is a
function that returns elementwise vectors of scores. In other words,
comparator(x, y, ...)
is equivalent to
elementwise(comparator, x, y, ...)
.
## Compute the absolute difference between two sets of scalar observations data("iris") x <- as.matrix(iris$Sepal.Width) y <- as.matrix(iris$Sepal.Length) elementwise(Euclidean(), x, y) ## Compute the edit distance between columns of two linked data.frames col.1 <- c("Hasna Yuhanna", "Korina Zenovia", "Phyllis Haywood", "Nicky Ellen") col.2 <- c("Hasna Yuhanna", "Corinna Zenovia", "Phyllis Dorothy Haywood", "Nicole Ellen") elementwise(Levenshtein(), col.1, col.2) Levenshtein()(col.1, col.2) # equivalent to above ## Recycling is used if the two collections don't contain the same number of objects elementwise(Levenshtein(), "Cora Zenovia", col.1)
## Compute the absolute difference between two sets of scalar observations data("iris") x <- as.matrix(iris$Sepal.Width) y <- as.matrix(iris$Sepal.Length) elementwise(Euclidean(), x, y) ## Compute the edit distance between columns of two linked data.frames col.1 <- c("Hasna Yuhanna", "Korina Zenovia", "Phyllis Haywood", "Nicky Ellen") col.2 <- c("Hasna Yuhanna", "Corinna Zenovia", "Phyllis Dorothy Haywood", "Nicole Ellen") elementwise(Levenshtein(), col.1, col.2) Levenshtein()(col.1, col.2) # equivalent to above ## Recycling is used if the two collections don't contain the same number of objects elementwise(Levenshtein(), "Cora Zenovia", col.1)
The Euclidean distance (a.k.a. L-2 distance) between two vectors and
is the square root of the sum of the squared differences of the
Cartesian coordinates:
Euclidean()
Euclidean()
A Euclidean
instance is returned, which is an S4 class inheriting
from Minkowski
.
The Euclidean distance is a special case of the Minkowski
distance with .
Other numeric comparators include Manhattan
, Minkowski
and
Chebyshev
.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Euclidean()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Euclidean() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Euclidean()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Euclidean() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
Compares a pair of token sets and
by computing the
optimal cost of transforming
into
using single-token
operations (insertions, deletions and substitutions). The cost of
single-token operations is determined at the character-level using an
internal string comparator.
FuzzyTokenSet( inner_comparator = Levenshtein(normalize = TRUE), agg_function = base::mean, deletion = 1, insertion = 1, substitution = 1 )
FuzzyTokenSet( inner_comparator = Levenshtein(normalize = TRUE), agg_function = base::mean, deletion = 1, insertion = 1, substitution = 1 )
inner_comparator |
inner string distance comparator of class
|
agg_function |
function used to aggregate the costs of the optimal
operations. Defaults to |
deletion |
non-negative weight associated with deletion of a token. Defaults to 1. |
insertion |
non-negative weight associated insertion of a token. Defaults to 1. |
substitution |
non-negative weight associated with substitution of a token. Defaults to 1. |
A token set is an unordered enumeration of tokens, which may include
duplicates. Given two token sets and
, this comparator
computes the optimal cost of transforming
into
using the
following single-token operations:
deleting a token from
at cost
inserting a token in
at cost
substituting a token in
for a token
in
at cost
where is an internal string comparator and
are non-negative weights, referred to as
deletion
,
insertion
and substitution
in the parameter list. By default, the
mean cost of the optimal set of operations is returned. Other methods of
aggregating the costs are supported by specifying a non-default
agg_function
.
If the internal string comparator is a distance function, then the optimal set of operations minimize the cost. Otherwise, the optimal set of operations maximize the cost. The optimization problem is solved exactly using a linear sum assignment solver.
This comparator is qualitatively similar to the MongeElkan
comparator, however it is arguably more principled, since it is formulated
as a cost optimization problem. It also offers more control over the costs
of missing tokens (by varying the deletion
and insertion
weights).
This is useful for comparing full names, when dropping a name (e.g.
middle name) shouldn't be severely penalized.
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') FuzzyTokenSet()(x, y) # Reduce the cost associated with missing words FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y) ## Compare full name with abbreviated name, reducing the penalty ## for dropping parts of the name fullname <- "JOSE ELIAS TEJADA BASQUES" name <- "JOSE BASQUES" # Tokenize strings on white space fullname <- strsplit(fullname, '\\s+') name <- strsplit(name, '\\s+') comparator <- FuzzyTokenSet(deletion = 0.5) comparator(fullname, name) < comparator(name, fullname) # TRUE
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') FuzzyTokenSet()(x, y) # Reduce the cost associated with missing words FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y) ## Compare full name with abbreviated name, reducing the penalty ## for dropping parts of the name fullname <- "JOSE ELIAS TEJADA BASQUES" name <- "JOSE BASQUES" # Tokenize strings on white space fullname <- strsplit(fullname, '\\s+') name <- strsplit(name, '\\s+') comparator <- FuzzyTokenSet(deletion = 0.5) comparator(fullname, name) < comparator(name, fullname) # TRUE
Geometric Mean
gmean(x, ...) ## Default S3 method: gmean(x, na.rm = FALSE, ...)
gmean(x, ...) ## Default S3 method: gmean(x, na.rm = FALSE, ...)
x |
An R object. Currently there are methods for numeric/logical
vectors and date, date-time and time interval objects. Complex vectors
are allowed for |
... |
further arguments passed to or from other methods. |
na.rm |
a logical value indicating whether |
The geometric mean of the values in x
is computed, as a numeric
or complex vector of length one. If x
is not logical (coerced to
numeric), numeric (including integer) or complex, NA_real_
is returned,
with a warning.
mean
for the arithmetic mean and hmean
for the harmonic
mean.
x <- c(1:10, 50) xm <- gmean(x)
x <- c(1:10, 50) xm <- gmean(x)
The Hamming distance between two strings/sequences of equal length is the number of positions where the corresponding characters/sequence elements differ. It can be viewed as a type of edit distance where the only permitted operation is substitution of characters/sequence elements.
Hamming( normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
Hamming( normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
normalize |
a logical. If TRUE, distances/similarities are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
When the input strings/sequences and
are of
different lengths (
), the Hamming distance
is defined to be
.
A Hamming similarity is returned if similarity = TRUE
. When
the similarity is defined as follows:
where is the Hamming similarity and
is the Hamming
distance. When
the similarity is defined to
be 0.
Normalization of the Hamming distance/similarity to the unit interval is
also supported by setting normalize = TRUE
. The raw distance/similarity
is divided by the length of the string/sequence . If
the normalized distance is defined to be 1,
while the normalized similarity is defined to be 0.
A Hamming
instance is returned, which is an S4 class inheriting from
StringComparator
.
While the unnormalized Hamming distance is a metric, the normalized variant is not as it does not satisfy the triangle inequality.
Other edit-based comparators include LCS
, Levenshtein
,
OSA
and DamerauLevenshtein
.
## Compare US ZIP codes x <- "90001" y <- "90209" m1 <- Hamming() # unnormalized distance m2 <- Hamming(similarity = TRUE, normalize = TRUE) # normalized similarity m1(x, y) m2(x, y)
## Compare US ZIP codes x <- "90001" y <- "90209" m1 <- Hamming() # unnormalized distance m2 <- Hamming(similarity = TRUE, normalize = TRUE) # normalized similarity m1(x, y) m2(x, y)
Harmonic Mean
hmean(x, ...) ## Default S3 method: hmean(x, trim = 0, na.rm = FALSE, ...)
hmean(x, ...) ## Default S3 method: hmean(x, trim = 0, na.rm = FALSE, ...)
x |
An R object. Currently there are methods for numeric/logical
vectors and date, date-time and time interval objects. Complex vectors
are allowed for |
... |
further arguments passed to or from other methods. |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each
end of |
na.rm |
a logical value indicating whether |
If trim
is zero (the default), the harmonic mean of the values
in x
is computed, as a numeric or complex vector of length one. If x
is not logical (coerced to numeric), numeric (including integer) or
complex, NA_real_
is returned, with a warning.
If trim
is non-zero, a symmetrically trimmed mean is computed with a
fraction of trim observations deleted from each end before the mean
is computed.
mean
for the arithmetic mean and gmean
for the geometric
mean.
x <- c(1:10, 50) xm <- hmean(x)
x <- c(1:10, 50) xm <- hmean(x)
Compares a pair of strings and
using a reference vocabulary.
Different scores are returned depending on whether both/one/neither of
and
are in the reference vocabulary.
InVocabulary( vocab, both_in_distinct = 0.7, both_in_same = 1, one_in = 1, none_in = 1, ignore_case = FALSE )
InVocabulary( vocab, both_in_distinct = 0.7, both_in_same = 1, one_in = 1, none_in = 1, ignore_case = FALSE )
vocab |
a vector containing in-vocabulary (known) strings. Any strings not in this vector are out-of-vocabulary (unknown). |
both_in_distinct |
score to return if the pair of values being
compared are both in |
both_in_same |
score to return if the pair of values being
compared are both in |
one_in |
score to return if only one of the pair of values being
compared is in |
none_in |
score to return if none of the pair of values being
compared is in |
ignore_case |
a logical. If TRUE, case is ignored when comparing the strings. |
This comparator is not intended to produce useful scores on its own. Rather, it is intended to produce multiplicative factors which can be applied to other similarity/distance scores. It is particularly useful for comparing names when a reference list (vocabulary) of known names is available. For example, it can be configured to down-weight the similarity scores of distinct (known) names like "Roberto" and "Umberto" which are semantically very different, but deceptively similar in terms of edit distance. The normalized Levenshtein similarity for these two names is 75%, but their similarity can be reduced to 53% if multiplied by the score from this comparator using the default settings.
An InVocabulary
instance is returned, which is an S4 class inheriting from
StringComparator
.
## Compare names with possible typos using a reference of known names known_names <- c("Roberto", "Umberto", "Alberto", "Emberto", "Norberto", "Humberto") m1 <- InVocabulary(known_names) m2 <- Levenshtein(similarity = TRUE, normalize = TRUE) x <- "Emberto" y <- c("Enberto", "Umberto") # "Emberto" and "Umberto" are likely to refer to distinct people (since # they are known distinct names) so their Levenshtein similarity is # downweighted to 0.61. "Emberto" and "Enberto" may refer to the same # person (likely typo), so their Levenshtein similarity of 0.87 is not # downweighted. similarities <- m1(x, y) * m2(x, y)
## Compare names with possible typos using a reference of known names known_names <- c("Roberto", "Umberto", "Alberto", "Emberto", "Norberto", "Humberto") m1 <- InVocabulary(known_names) m2 <- Levenshtein(similarity = TRUE, normalize = TRUE) x <- "Emberto" y <- c("Enberto", "Umberto") # "Emberto" and "Umberto" are likely to refer to distinct people (since # they are known distinct names) so their Levenshtein similarity is # downweighted to 0.61. "Emberto" and "Enberto" may refer to the same # person (likely typo), so their Levenshtein similarity of 0.87 is not # downweighted. similarities <- m1(x, y) * m2(x, y)
Compares a pair of strings/sequences x
and y
based on the number of
greedily-aligned characters/sequence elements and the number of
transpositions. It was developed for comparing names at the U.S. Census
Bureau.
Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)
Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)
similarity |
a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details). |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
When similarity = TRUE
(default), the Jaro similarity is computed as
where is the number of "matching" characters (defined below),
is the number of "transpositions", and
are the
lengths of the strings
and
. The similarity takes on values
in the range
, where 1 corresponds to a perfect match.
The number of "matching" characters is computed using a greedy
alignment algorithm. The algorithm iterates over the characters in
,
attempting to align the
-th character
with the first
matching character in
. When looking for matching characters in
, the algorithm only considers previously un-matched characters
within a window
where
.
The alignment process yields a subsequence of matching characters from
and
. The number of "transpositions"
is defined to
be the number of positions in the subsequence of
which are
misaligned with the corresponding position in
.
When similarity = FALSE
, the Jaro distance is computed as
A Jaro
instance is returned, which is an S4 class inheriting from
StringComparator
.
The Jaro distance is not a metric, as it does not satisfy the
identity axiom
Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.
The JaroWinkler
comparator modifies the Jaro
comparator by
boosting the similarity score for strings/sequences that have matching
prefixes.
## Compare names Jaro()("Martha", "Mathra") Jaro()("Eileen", "Phyllis")
## Compare names Jaro()("Martha", "Mathra") Jaro()("Eileen", "Phyllis")
The Jaro-Winkler comparator is a variant of the Jaro
comparator which
boosts the similarity score for strings/sequences with matching prefixes.
It was developed for comparing names at the U.S. Census Bureau.
JaroWinkler( p = 0.1, threshold = 0.7, max_prefix = 4L, similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE )
JaroWinkler( p = 0.1, threshold = 0.7, max_prefix = 4L, similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE )
p |
a non-negative numeric scalar no larger than 1/max_prefix. Similarity scores eligible for boosting are scaled by this factor. |
threshold |
a numeric scalar on the unit interval. Jaro similarities greater than this value are boosted based on matching characters in the prefixes of both strings. Jaro similarities below this value are returned unadjusted. Defaults to 0.7. |
max_prefix |
a non-negative integer scalar, specifying the size of the prefix to consider for boosting. Defaults to 4 (characters). |
similarity |
a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details). |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
The Jaro-Winkler similarity (computed when similarity = TRUE
) is
defined in terms of the Jaro
similarity. If the Jaro similarity
between strings
and
exceeds a
user-specified threshold
,
the similarity score is boosted in proportion to the number of matching
characters in the prefixes of
and
. More precisely, the
Jaro-Winkler similarity is defined as:
where is the length of the common prefix,
is a user-specified upper bound on the prefix size, and
is a scaling factor.
The Jaro-Winkler distance is computed when similarity = FALSE
and is
defined as
A JaroWinkler
instance is returned, which is an S4 class inheriting from
StringComparator
.
Like the Jaro distance, the Jaro-Winkler distance is not a metric as it does not satisfy the identity axiom.
Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.
Winkler, W. E. (2006), "Overview of Record Linkage and Current Research Directions", Tech. report. Statistics #2006-2. Statistical Research Division, U.S. Census Bureau.
Winkler, W., McLaughlin G., Jaro M. and Lynch M. (1994), strcmp95.c, Version 2. United States Census Bureau.
This comparator reduces to the Jaro
comparator when max_prefix = 0L
or threshold = 0.0
.
## Compare names JaroWinkler()("Martha", "Mathra") JaroWinkler()("Eileen", "Phyllis") ## Reduce the threshold for boosting x <- "Matthew" y <- "Martin" JaroWinkler()(x, y) < JaroWinkler(threshold = 0.5)(x, y)
## Compare names JaroWinkler()("Martha", "Mathra") JaroWinkler()("Eileen", "Phyllis") ## Reduce the threshold for boosting x <- "Matthew" y <- "Martin" JaroWinkler()(x, y) < JaroWinkler(threshold = 0.5)(x, y)
The Longest Common Subsequence (LCS) distance between two
strings/sequences and
is the minimum cost of operations
(insertions and deletions) required to transform
into
.
The LCS similarity is more commonly used, which can be interpreted as the
length of the longest subsequence common to
and
.
LCS( deletion = 1, insertion = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
LCS( deletion = 1, insertion = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
An LCS similarity is returned if similarity = TRUE
, which
is defined as
where ,
are the number of characters in
and
respectively,
is the LCS distance,
is the cost of a deletion and
is the cost of an insertion.
Normalization of the LCS distance/similarity to the unit interval
is also supported by setting normalize = TRUE
. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion and deletion
are equal.
The normalized distance
is defined as
and the normalized similarity is defined as
A LCS
instance is returned, which is an S4 class inheriting from
StringComparator
.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and
. In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
Bergroth, L., Hakonen, H., & Raita, T. (2000), "A survey of longest common subsequence algorithms", Proceedings Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00), 39-48.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.
Other edit-based comparators include Hamming
, Levenshtein
,
OSA
and DamerauLevenshtein
.
## There are no common substrings of size 3 for the following example, ## however there are two common substrings of size 2: "AC" and "BC". ## Hence the LCS similarity is 2. x <- "ABCDA"; y <- "BAC" LCS(similarity = TRUE)(x, y) ## Levenshtein distance reduces to LCS distance when the cost of ## substitution is high x <- "ABC"; y <- "AAA" LCS()(x, y) == Levenshtein(substitution = 100)(x, y)
## There are no common substrings of size 3 for the following example, ## however there are two common substrings of size 2: "AC" and "BC". ## Hence the LCS similarity is 2. x <- "ABCDA"; y <- "BAC" LCS(similarity = TRUE)(x, y) ## Levenshtein distance reduces to LCS distance when the cost of ## substitution is high x <- "ABC"; y <- "AAA" LCS()(x, y) == Levenshtein(substitution = 100)(x, y)
The Levenshtein (edit) distance between two strings/sequences and
is the minimum cost of operations (insertions, deletions or
substitutions) required to transform
into
.
Levenshtein( deletion = 1, insertion = 1, substitution = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
Levenshtein( deletion = 1, insertion = 1, substitution = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
A Levenshtein similarity is returned if similarity = TRUE
, which
is defined as
where ,
are the number of characters in
and
respectively,
is the Levenshtein distance,
is the cost of a deletion and
is the cost of an
insertion.
Normalization of the Levenshtein distance/similarity to the unit interval
is also supported by setting normalize = TRUE
. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion and deletion
are equal.
The normalized distance
is defined as
and the normalized similarity is defined as
A Levenshtein
instance is returned, which is an S4 class inheriting from
StringComparator
.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and
. In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.
Other edit-based comparators include Hamming
, LCS
,
OSA
and DamerauLevenshtein
.
## Compare names with potential typos x <- c("Brian Cheng", "Bryan Cheng", "Kondo Onyejekwe", "Condo Onyejekve") pairwise(Levenshtein(), x, return_matrix = TRUE) ## When the substitution cost is high, Levenshtein distance reduces to LCS distance Levenshtein(substitution = 100)("Iran", "Iraq") == LCS()("Iran", "Iraq")
## Compare names with potential typos x <- c("Brian Cheng", "Bryan Cheng", "Kondo Onyejekwe", "Condo Onyejekve") pairwise(Levenshtein(), x, return_matrix = TRUE) ## When the substitution cost is high, Levenshtein distance reduces to LCS distance Levenshtein(substitution = 100)("Iran", "Iraq") == LCS()("Iran", "Iraq")
Compares a pair of strings and
by retrieving
their distance/similarity score from a provided lookup table.
Lookup( lookup_table, values_colnames, score_colname, default_match = 0, default_nonmatch = NA_real_, symmetric = TRUE, ignore_case = FALSE )
Lookup( lookup_table, values_colnames, score_colname, default_match = 0, default_nonmatch = NA_real_, symmetric = TRUE, ignore_case = FALSE )
lookup_table |
data frame containing distances/similarities for pairs of values |
values_colnames |
character vector containing the colnames
corresponding to pairs of values (e.g. strings) in |
score_colname |
name of column that contains distances/similarities
in |
default_match |
distance/similarity to use if the pair of values
match exactly and do not appear in |
default_nonmatch |
distance/similarity to use if the pair of values are
not an exact match and do not appear in |
symmetric |
whether the underlying distance/similarity scores are
symmetric. If TRUE |
ignore_case |
a logical. If TRUE, case is ignored when comparing the strings. |
The lookup table should contain three columns corresponding to ,
and
(
values_colnames
below) and the distance/similarity
(score_colname
below). If a pair of values and
is
not in the lookup table, a default distance/similarity is returned
depending on whether
(
default_match
below) or
(
default_nonmatch
below).
A Lookup
instance is returned, which is an S4 class inheriting from
StringComparator
.
## Measure the distance between cities lookup_table <- data.frame(x = c("Melbourne", "Melbourne", "Sydney"), y = c("Sydney", "Brisbane", "Brisbane"), dist = c(713.4, 1374.8, 732.5)) comparator <- Lookup(lookup_table, c("x", "y"), "dist") comparator("Sydney", "Melbourne") comparator("Melbourne", "Perth")
## Measure the distance between cities lookup_table <- data.frame(x = c("Melbourne", "Melbourne", "Sydney"), y = c("Sydney", "Brisbane", "Brisbane"), dist = c(713.4, 1374.8, 732.5)) comparator <- Lookup(lookup_table, c("x", "y"), "dist") comparator("Sydney", "Melbourne") comparator("Melbourne", "Perth")
The Manhattan distance (a.k.a. L-1 distance) between two vectors and
is the sum of the absolute differences of their Cartesian
coordinates:
Manhattan()
Manhattan()
A Manhattan
instance is returned, which is an S4 class inheriting
from Minkowski
.
The Manhattan distance is a special case of the Minkowski
distance with .
Other numeric comparators include Euclidean
, Minkowski
and
Chebyshev
.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Manhattan()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Manhattan() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Manhattan()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Manhattan() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
The Minkowski distance (a.k.a. L-p distance) between two vectors and
is the p-th root of the sum of the absolute differences of their
Cartesian coordinates raised to the p-th power:
Minkowski(p = 2)
Minkowski(p = 2)
p |
a positive numeric specifying the order of the distance. Defaults
to 2 (Euclidean distance). If |
A Minkowski
instance is returned, which is an S4 class inheriting
from NumericComparator
.
Other numeric comparators include Manhattan
, Euclidean
and
Chebyshev
.
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Minkowski()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Minkowski() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
## Distance between two vectors x <- c(0, 1, 0, 1, 0) y <- seq_len(5) Minkowski()(x, y) ## Distance between rows (elementwise) of two matrices comparator <- Minkowski() x <- matrix(rnorm(25), nrow = 5) y <- matrix(rnorm(5), nrow = 1) elementwise(comparator, x, y) ## Distance between rows (pairwise) of two matrices pairwise(comparator, x, y)
Compares a pair of token sets and
by computing similarity
scores between all pairs of tokens using an internal string comparator,
then taking the mean of the maximum scores for each token in
.
MongeElkan( inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE), agg_function = base::mean, symmetrize = FALSE )
MongeElkan( inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE), agg_function = base::mean, symmetrize = FALSE )
inner_comparator |
internal string comparator of class
|
agg_function |
aggregation function to use when aggregating internal
similarities/distances between tokens. Defaults to |
symmetrize |
logical indicating whether to use a symmetrized version of the Monge-Elkan comparator. Defaults to FALSE. |
A token set is an unordered enumeration of tokens, which may include
duplicates.
Given two token sets and
, the Monge-Elkan comparator is
defined as:
where is the i-th token in
,
is the
number of tokens in
and
is an internal
string similarity comparator.
A generalization of the original Monge-Elkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized Monge-Elkan comparator is defined as:
where is an internal distance or similarity
comparator,
is
if
is a similarity comparator or
if
it is a distance comparator, and
is an aggregation
function which takes a vector of scores for each token in
and
returns a scalar.
By default, the Monge-Elkan comparator is asymmetric in its arguments
and
. If
symmetrize = TRUE
, a symmetric version of the comparator
is obtained as follows
where is defined above.
A MongeElkan
instance is returned, which is an S4 class inheriting from
StringComparator
.
Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267-270.
Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized Monge-Elkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559-570.
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') MongeElkan()(x, y) ## The symmetrized variant is arguably more appropriate for this example MongeElkan(symmetrize = TRUE)(x, y) ## Using a different internal comparator changes the result MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') MongeElkan()(x, y) ## The symmetrized variant is arguably more appropriate for this example MongeElkan(symmetrize = TRUE)(x, y) ## Using a different internal comparator changes the result MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)
Represents a comparator for comparing pairs of numeric vectors.
.Data
a function that calls the elementwise method for this class,
with arguments x
, y
and ...
.
symmetric
a logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y)
is identical to
comparator(y, x)
.
distance
a logical of length 1. If TRUE
, the comparator produces
distances and satisfies comparator(x, x) = 0
. The comparator may not
satisfy all of the properties of a distance metric.
similarity
a logical of length 1. If TRUE
, the comparator produces
similarity scores.
tri_inequal
a logical of length 1. If TRUE
, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE
and symmetric = TRUE
.
The Optimal String Alignment (OSA) distance between two strings/sequences
and
is the minimum cost of operations (insertions,
deletions, substitutions or transpositions) required to transform
into
, subject to the constraint that no substring/subsequence is
edited more than once.
OSA( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
OSA( deletion = 1, insertion = 1, substitution = 1, transposition = 1, normalize = FALSE, similarity = FALSE, ignore_case = FALSE, use_bytes = FALSE )
deletion |
positive cost associated with deletion of a character or sequence element. Defaults to unit cost. |
insertion |
positive cost associated insertion of a character or sequence element. Defaults to unit cost. |
substitution |
positive cost associated with substitution of a character or sequence element. Defaults to unit cost. |
transposition |
positive cost associated with transposing (swapping) a pair of characters or sequence elements. Defaults to unit cost. |
normalize |
a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. |
similarity |
a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. |
ignore_case |
a logical. If TRUE, case is ignored when comparing strings. |
use_bytes |
a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. |
For simplicity we assume x
and y
are strings in this section,
however the comparator is also implemented for more general sequences.
An OSA similarity is returned if similarity = TRUE
, which
is defined as
where ,
are the number of characters in
and
respectively,
is the OSA distance,
is the cost of a deletion and
is the cost of an insertion.
Normalization of the OSA distance/similarity to the unit interval
is also supported by setting normalize = TRUE
. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion and deletion
are equal.
The normalized distance
is defined as
and the normalized similarity is defined as
An OSA
instance is returned, which is an S4 class inheriting from
StringComparator
.
If the costs of deletion and insertion are equal, this comparator is
symmetric in and
. The OSA distance is not a proper metric
as it does not satisfy the triangle inequality. The Damerau-Levenshtein
distance is closely related—it allows the same edit operations as OSA,
but removes the requirement that no substring can be edited more than once.
Boytsov, L. (2011), "Indexing methods for approximate dictionary searching: Comparative analysis", ACM J. Exp. Algorithmics 16, Article 1.1.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 1091–1095.
Other edit-based comparators include Hamming
, LCS
,
Levenshtein
and DamerauLevenshtein
.
## Compare strings with a transposition error x <- "plauge"; y <- "plague" OSA()(x, y) != Levenshtein()(x, y) ## Unlike Damerau-Levenshtein, OSA does not allow a substring to be ## edited more than once x <- "ABC"; y <- "CA" OSA()(x, y) != DamerauLevenshtein()(x, y) ## Compare car names using normalized OSA similarity data(mtcars) cars <- rownames(mtcars) pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)
## Compare strings with a transposition error x <- "plauge"; y <- "plague" OSA()(x, y) != Levenshtein()(x, y) ## Unlike Damerau-Levenshtein, OSA does not allow a substring to be ## edited more than once x <- "ABC"; y <- "CA" OSA()(x, y) != DamerauLevenshtein()(x, y) ## Compare car names using normalized OSA similarity data(mtcars) cars <- rownames(mtcars) pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)
Computes pairwise similarities/distances between two collections of objects (strings, vectors, etc.) using the provided comparator.
pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Comparator,ANY,missing' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,matrix,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,vector,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'Minkowski,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...)
pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Comparator,ANY,missing' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'CppSeqComparator,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'StringComparator,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,matrix,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'NumericComparator,vector,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Chebyshev,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,matrix' elementwise(comparator, x, y, ...) ## S4 method for signature 'Minkowski,matrix,matrix' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Minkowski,matrix,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'FuzzyTokenSet,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'InVocabulary,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,vector' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'Lookup,vector,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,list' pairwise(comparator, x, y, return_matrix = FALSE, ...) ## S4 method for signature 'MongeElkan,list,'NULL'' pairwise(comparator, x, y, return_matrix = FALSE, ...)
comparator |
a comparator used to compare the objects, which is a
sub-class of |
x , y
|
a collection of objects to compare, typically stored as entries
in an atomic vector, rows in a matrix, or entries in a list. The required
format depends on the type of |
return_matrix |
a logical of length 1. If FALSE (default), the pairwise
similarities/distances will be returned as a |
... |
other parameters passed on to other methods. |
If both x
and y
are specified, every object in x
is compared with
every object in y
using the comparator, and the resulting scores are
returned in a size(x)
by size(y)
matrix.
If only x
is specified, then the objects in x
are compared with
themselves using the comparator, and the resulting scores are returned in a
size(x)
by size(y)
matrix.
By default, the matrix is represented as an instance of the
PairwiseMatrix
class, which is more space-efficient for symmetric
comparators when y
is not specified. However, if return_matrix = TRUE
,
the matrix is returned as an ordinary matrix
instead.
comparator = Comparator,x = ANY,y = missing
: Compute a pairwise comparator when y
comparator = CppSeqComparator,x = list,y = list
: Specialization for CppSeqComparator
where x
and y
are lists of sequences (vectors) to compare.
comparator = CppSeqComparator,x = list,y = NULL
: Specialization for CppSeqComparator
where x
is
a list of sequences (vectors) to compare.
comparator = StringComparator,x = vector,y = vector
: Specialization for StringComparator
where x
and y
are vectors of strings to compare.
comparator = StringComparator,x = vector,y = NULL
: Specialization for StringComparator
where x
is a vector of strings to compare.
comparator = NumericComparator,x = matrix,y = vector
: Specialization for NumericComparator
where x
is a matrix of rows (interpreted as vectors) to compare with a vector y
.
comparator = NumericComparator,x = vector,y = matrix
: Specialization for NumericComparator
where x
is a vector to compare with a matrix y
of rows (interpreted as vectors).
comparator = Chebyshev,x = matrix,y = matrix
: Specialization for Chebyshev
where x
and y
matrices of rows (interpreted as vectors) to compare.
comparator = Chebyshev,x = matrix,y = NULL
: Specialization for Minkowski
where x
is a matrix
of rows (interpreted as vectors) to compare among themselves.
comparator = Minkowski,x = matrix,y = matrix
: Specialization for a Minkowski
where x
and y
matrices of rows (interpreted as vectors) to compare.
comparator = Minkowski,x = matrix,y = matrix
: Specialization for a Minkowski
where x
and y
matrices of rows (interpreted as vectors) to compare.
comparator = Minkowski,x = matrix,y = NULL
: Specialization for Minkowski
where x
is a matrix
of rows (interpreted as vectors) to compare among themselves.
comparator = FuzzyTokenSet,x = list,y = list
: Specialization for FuzzyTokenSet
where x
and y
are
lists of token vectors to compare.
comparator = FuzzyTokenSet,x = vector,y = NULL
: Specialization for FuzzyTokenSet
where x
is a list of token
vectors to compare among themselves.
comparator = InVocabulary,x = vector,y = vector
: Specialization for InVocabulary
where x
and y
are vectors of strings to compare.
comparator = InVocabulary,x = vector,y = NULL
: Specialization for InVocabulary
where x
is a
vector of strings to compare among themselves.
comparator = Lookup,x = vector,y = vector
: Specialization for a Lookup
where x
and y
are
vectors of strings to compare
comparator = Lookup,x = vector,y = NULL
: Specialization for Lookup
where x
is a vector of
strings to compare among themselves
comparator = MongeElkan,x = list,y = list
: Specialization for MongeElkan
where x
and y
are
lists of token vectors to compare.
comparator = MongeElkan,x = list,y = NULL
: Specialization for MongeElkan
where x
is a list
of token vectors to compare among themselves.
## Computing the distances between a query point y (a 3D numeric vector) ## and a set of reference points x x <- rbind(c(1,0,1), c(0,0,0), c(-1,2,-1)) y <- c(10, 5, 10) pairwise(Manhattan(), x, y) ## Computing the pairwise similarities among a set of strings x <- c("Benjamin", "Ben", "Benny", "Bne", "Benedict", "Benson") comparator <- DamerauLevenshtein(similarity = TRUE, normalize = TRUE) pairwise(comparator, x, return_matrix = TRUE) # return an ordinary matrix
## Computing the distances between a query point y (a 3D numeric vector) ## and a set of reference points x x <- rbind(c(1,0,1), c(0,0,0), c(-1,2,-1)) y <- c(10, 5, 10) pairwise(Manhattan(), x, y) ## Computing the pairwise similarities among a set of strings x <- c("Benjamin", "Ben", "Benny", "Bne", "Benedict", "Benson") comparator <- DamerauLevenshtein(similarity = TRUE, normalize = TRUE) pairwise(comparator, x, return_matrix = TRUE) # return an ordinary matrix
Represents a pairwise similarity or distance matrix.
as.PairwiseMatrix(x, ...) ## S4 method for signature 'matrix' as.PairwiseMatrix(x, ...) ## S4 method for signature 'PairwiseMatrix' as.matrix(x, ...)
as.PairwiseMatrix(x, ...) ## S4 method for signature 'matrix' as.PairwiseMatrix(x, ...) ## S4 method for signature 'PairwiseMatrix' as.matrix(x, ...)
x |
an R object. |
... |
additional arguments to be passed to methods. |
If the elements being compared are from the same set, the matrix may be symmetric if the comparator is symmetric. In this case, entries in the upper triangle and/or along the diagonal may not be stored in memory, since they are redundant.
as.PairwiseMatrix
: Convert an R object x
to a PairwiseMatrix
.
as.PairwiseMatrix,matrix-method
: Convert an ordinary matrix
x
to a PairwiseMatrix
.
as.matrix,PairwiseMatrix-method
: Convert a PairwiseMatrix
x
to an ordinary matrix
.
.Data
entries of the matrix in column-major order. Entries in the upper triangle and/or on the diagonal may be omitted.
Dim
integer vector of length 2. The dimensions of the matrix.
Diag
logical indicating whether the diagonal entries are stored in
.Data
.
Represents a comparator for pairs of sequences.
.Data
a function that calls the elementwise method for this class,
with arguments x
, y
and ...
.
symmetric
a logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y)
is identical to
comparator(y, x)
.
distance
a logical of length 1. If TRUE
, the comparator produces
distances and satisfies comparator(x, x) = 0
. The comparator may not
satisfy all of the properties of a distance metric.
similarity
a logical of length 1. If TRUE
, the comparator produces
similarity scores.
tri_inequal
a logical of length 1. If TRUE
, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE
and symmetric = TRUE
.
Represents a comparator for pairs of strings.
.Data
a function that calls the elementwise method for this class,
with arguments x
, y
and ...
.
symmetric
a logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y)
is identical to
comparator(y, x)
.
distance
a logical of length 1. If TRUE
, the comparator produces
distances and satisfies comparator(x, x) = 0
. The comparator may not
satisfy all of the properties of a distance metric.
similarity
a logical of length 1. If TRUE
, the comparator produces
similarity scores.
tri_inequal
a logical of length 1. If TRUE
, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE
and symmetric = TRUE
.
ignore_case
a logical of length 1. If TRUE, case is ignored when comparing strings. Defaults to FALSE.
use_bytes
a logical of length 1. If TRUE, strings are compared byte-by-byte rather than character-by-character.
Represents a comparator for pairs of token sequences.
.Data
a function that calls the elementwise method for this class,
with arguments x
, y
and ...
.
symmetric
a logical of length 1. If TRUE, the comparator is symmetric
in its arguments—i.e. comparator(x, y)
is identical to
comparator(y, x)
.
distance
a logical of length 1. If TRUE
, the comparator produces
distances and satisfies comparator(x, x) = 0
. The comparator may not
satisfy all of the properties of a distance metric.
similarity
a logical of length 1. If TRUE
, the comparator produces
similarity scores.
tri_inequal
a logical of length 1. If TRUE
, the comparator satisfies
the triangle inequality. This is only possible (but not guaranteed) if
distance = TRUE
and symmetric = TRUE
.
ordered
a logical of length 1. If TRUE, the comparator treats token sequences as ordered, otherwise they are treated as unordered.