I must admit to not really understanding the TidyVerse attraction, or why I should be doing "tidy data evaluation", rather than using base R.
If I want to use an advanced data manipulation library, I'd typically reach for data.table. If I want to use a verb based approach, why not SQL rather than dplyr.
I have tried dplyr and code I've written a few years ago still imports that package (hopefully it still works), but I just didn't find it was particularly useful or helpful compared to the alternatives.
Tidyverse is more than dplyr. It also includes libraries like ggplot2, for which there really is no peer. Also stringr, for string manipulation, tidyr, which has a wide variety of very useful utility functions for working with data. dplyr is simply the backbone that connects them all and allows you to move between them without switching thought processes.
dplyr also offers functionality that many don't realize beyond its basic data manipulation. tibbles allow for arbitrary data types for columns, meaning you can have data frames containing your typical strings or floats, but you can also have data frames with columns consisting of nested data frames, or fitted models, or other atypical objects. Once you start working in this way, it can really streamline a lot of complex analytical processes, make your code much cleaner and easier to work with, and allows you to integrate them directly into other tidyverse packages, like ggplot2.
> If I want to use an advanced data manipulation library, I'd typically reach for data.table
Data.table is powerful, but from a usability/readability perspective, most people find it inferior to dplyr. And there are now packages for using the dplyr API and having it run data.table on the backend, so I personally see little to no use for using data.table by itself anymore.
> If I want to use a verb based approach, why not SQL rather than dplyr.
This seems like an overly simplistic reduction. There are bigger differences between dplyr and SQL than both being a verb based approach. And even were there not, dplyr is directly integrated within R. It is still much easier to do your data manipulation directly in R and handing it back and forth between dplyr and whatever other libraries you are using, than it is to do the same in SQL (even utilizing SQL queries within R).
The beauty of the tidyverse is that it is unifying the most important aspects of the data science process under a single approach and API philosophy. It integrates in a way that is pretty unprecedented not just in R, but in any programming language (where data analysis tasks are concerned).
I don't want to say that the Tidyverse is inferior, because it's not and it works for a lot of people. Hadley will also know much more about R and always be a better programmer than me.
However, there are other tools which predate Tidyverse, which I think so a stellar job at their core competency.
ggplot2 is great, but it was also around long before the Tidyverse concept. But at the same time base R plot() can also be pretty powerful[1] and look great.[2] As an alternative to ggplot2 I would also propose that Vega Lite[3] could be a contender, with an excellent cross language ecosystem.
There are also libraries available for applying SQL on dataframes from directly within R if that's what you want to do. sqldf[4] has been around for a long time, and now there is also the new duckdf[5], which is a bit quicker. Or one can use the DBI[6] library, which requires a bit more coding. Learning SQL is also a great skill which has a lot of value outside R.
Tidyverse may be useful, helpful and convenient for a lot of people, but I think we shouldn't lose sight of the wide R ecosystem which has provided a lot of alternative packages for a long time, perhaps without the marketing and profile of RStudio and the Tidyverse.
> includes libraries like ggplot2, for which there really is no peer
lattice is largely ignored but it is quite similar to ggplot2 in terms of features, and (as a matter of opinion) the plots it produces are aesthetically more pleasing, also great for making subplots using conditioning variables
SQL is incredibly verbose compared to dplyr. Modern non-SQL query languages coming out tend to be more more similar to dplyr, based on method chaining or piping data table objects through function calls, like UNIX pipes. It's much more composable than SQL and it just makes intuitive sense as a sequence of data transformation steps in a pipeline.
My favorite feature of dplyr that makes it stand out compared to SQL is that a window function is just a group_by() without a summarize(). There's no separate syntax for "PARTITION BY".
For data analysis, whenever possible, I don't even write SQL anymore, I use the dbplyr package, which is a dplyr-to-SQL compiler.
I studied statistics, so R was the first programming I ever learned. I didn't know what it was at the time, but the tidyverse's chaining of operations on a data frame (via magrittr's %>%) was a really cool introduction to functional programming
>The pipe operator in R also sort of does my head in.
As someone who taught myself base R from scratch in 2013, I (used to!) agree with this. When the pipe operator was first introduced, I’d roll my eyes whenever I saw a script that used it and move along.
But I forced my brain to adapt, and now it’s probably my favorite feature of the R language. Data science is full of sequences of transformations, and in my opinion it’s more readable and bug-resistant to phrase these long chains as:
f(x) %>% g() %>% h()
rather than:
h(g(f(x)))
or certainly:
foo <- f(x)
foo <- g(foo)
foo <- h(foo)
I can comprehend and modify others’ (to include past versions of me) R code much more quickly with this paradigm. You can quickly debug a chain by commenting out functions sequentially (i.e. first test: “f(x) # %>% ...”). It also becomes much faster to plug new transformations into the chain, when needed.
One thing that helps to keep track of the input x as it moves through the chain is using the “.” placeholder (especially when you need to specify function arguments), like so:
It's great for data analysis and terrible for programming.
The reason it sucks for programming is because you can't debug it or inspect intermediate variables, which is really really annoying.
It's great for one off transformations and plotting, but a really, really really bad idea for programming.
Then you add NSE, which makes it hard to functionalise procedural pipes (especially for people who learned tidyverse) and it's a recipe for unmaintainable and profoundly annoying legacy code.
That bring said,I love it for interactive analysis.
If I want to use an advanced data manipulation library, I'd typically reach for data.table. If I want to use a verb based approach, why not SQL rather than dplyr.
I have tried dplyr and code I've written a few years ago still imports that package (hopefully it still works), but I just didn't find it was particularly useful or helpful compared to the alternatives.