I had never expected such a problem, much less a solution, to exist till I was asked yesterday to solve it. The problem statement: given a dataset and a list of candidate variables, find which minimal combination, if any, is a valid identifier for the observations in the dataset.
Following is a generic with methods for matrix
, data.frame
, and data.table
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Here are some tests for the implemenation above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
The minCombn
and maxCombn
variables are used to specify how many candidate
columns must be considered together to find an identifier. ...
can be used to
pass further arguments to anyDuplicated
. The code is smart enough to try all
k
variable combinations before attempting any combination of k + 1
variables.
Benchmarking
The data.table
method for this function was implemented after this discussion
at StackOverflow. Therefore, I use a similar example to show the efficiency of
the two methods here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
We can see that the default
method is faster when evaluating a single column
as id. However, that is the only case where it does better. It scales poorly;
especially if compared against the data.table
method. When you are searching
for an identifier you probably are going to try various combinations and
data.table
will almost always be much faster.