Hadley Wickham, with Romain Francois and Dirk Eddelbuettel, has released a new package called dplyr. Once again Wickham has redesigned an important component of what one does with R. The package is designed specifically for rectangular datasets which may live in memory or databases. With the new package, Hadley also convincingly closes the efficiency gap between plyr and data.table. This new efficiency comes largely from the various C++ injections into the performance critical branches of code in dplyr
(thanks to Romain and Eddelbuettel).
On the other hand data.table
, a mature and popular project, started by Matthew Dowle has been the go-to for all performance-centric R programmers for some time now. While dplyr
raises a serious contention to data.table
’s claim to fame, both data.table
and Dowle are old hands at such competition. There has long been (very healthy) competition between Hadley’s plyr
, Dowle’s data.table
, and Wes McKinney’s pandas
(a data munging library for Python).
In this post I add another data point to the set of benchmarks of the two packages. For the official take, see this and this.
The set up
Because both dplyr
and data.table
have been written with efficiency for big datasets in mind, we must consider at least a moderately sized dataset for the benchmarks. Based on memory and patience I had available while writing the benchmarks, I created this synthetic dataset for the benchmarking.
1 2 3 4 5 6 7 |
|
The following is the number of steps we want to perform to transform this sample into a (not-so-much) useful summary.
- Filter the samp to include only the first twenty of
letters
and the last twenty ofLETTERS
. - Select only columns x, y, and z out of the data.frame.
- Create two new columns:
- xProp = x / sum(x)
- yScale = (y - mean(y)) / sd(y)
- Calculate mean(xProp) and mean(yScale) by z.
- Arrange / order output by z.
Expression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
IMHO, the syntax for both data.table
and dplyr
is cleaner and more consistent for the kind of operations considered. But this is unsurprising because both are Domain Specific Languages (DSLs) built specificlly for this limited functionality. It is also annoying that both syntaxes confuse Vim and demand manual formatting.
Both dplyr
and data.table
also accept column names as free variables in expressions. Of the following three expressions,
1 2 3 |
|
I find the second form more elegant than the first. Although, personally, I also find the third most correct in intent.
If I were to judge these three on syntax, I’d probably choose dplyr
for it’s design of what is now being called a grammar of data manipulation. data.table
on the other hand is notorious for being so far removed from traditional R (and everything else) that even advanced R users may find it tiresome. Despite the tiring nature of the new syntax, it is quite dense in expression (perhaps a little too dense). I need more experience with dplyr
to be able to find situations where this sub-language struggles in expression.
Efficiency benchmarks
Here is the code used to benchmark the three ways of expressing the desired data manipulation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
On a moderately big dataset and a realistic computation, both data.table
and dplyr
are pretty efficient and pretty similar. However, base R is not orders of magnitude worse. Tight code in base R is still quite competitive; the rub is that it takes significant effort to write tight R code.
The gap between data.table
and dplyr
is too small at the moment to tip the scales in anyone’s favor. This essentially corroborates the official benchmarks from the dplyr
and data.table
communities. However, I am also looking forward to validating the claim that data.table-1.8.11
will push the bar higher. Hadley has made similar comments about extending dplyr
and making it more efficient. Another noteworthy remark is that while data.table
is at stable version 1.8.0
, dplyr
is at current version 0.1
. This definitely makes the latter the new cool kid on the block.
Conclusion
Given that software shows significant switching (learning) costs and network benefits, this contest becomes very interesting. There are comparably sized plyr
and data.table
communities already. Both packages are different enough (from each other and base R) that they are inherently incompatible standards (though dplyr is trying to subsume data.table). The individual popularity of the two lead developers, Hadley and Dowle, may play a role. I believe that this competition will not be over soon. I am also quite certain that this competition between the incumbent and the challenger will create abundunant surplus for the users of R. Amen!
Here is the code gist.