I’m a little late to the party on this but I thought that I would add a post about it because it is something that relates to my teaching and I was slightly surprised by the conclusion I ended up coming to. The topic is piping and the strengths and weaknesses of the magrittr
pipe %>%
versus the (relatively) new base pipe |>
.
The background to this post is that I was making a general move towards switching from %>%
to |>
. That is, when I started writing some brand new code in a brand new RStudio project (and git repo), I changed to using |>
. Some of this new code generated a weird error and on fixing it I ended up doing a deeper dive into the base pipe and that led to this post.
Before I get too much further I would just add that I am a big fan of the pipe and piped operations. In my opinion it generates far more readable code than base R
(prior to the pipe). I personally use it all the time and introduce it to my students as soon as I think is reasonably sensible.
Anyway, onto the discussion and the secondary reason for the post - to give me something to refer back to when I forget the specifics. I’ve used the magrittr pipe %>%
for ages and the example below (from the old version of the map()
help page) is a good example of where I might use it.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared")
4 6 8
0.5086326 0.4645102 0.4229655
On digging into the base pipe, which I was using as part of map()
call, I used ?map
and got the updated version of the above code.
mtcars |>
group_split(mtcars$cyl) |>
map(\(df) lm(mpg ~ wt, data = df)) |>
map(summary) |>
map_dbl("r.squared")
[1] 0.5086326 0.4645102 0.4229655
These two pieces of code look reasonably similar but you’ll notice the lack of .
notation in the second. This led me reading about |>
and the differences. In turn I found this excellent blog post that provided better examples than I found in other documentation. For me the crux of the differences between |>
and %>%
are that:
.
notation.The anonymous function defined above as \(df) lm(mpg ~ wt, data = df))
is way around the issue with .
notation. For the most part, the first point above (LHS into RHS) is rarely going to be an issue, particularly with dplyr
functions. For example:
mtcars %>%
group_by(cyl) %>%
summarise(av_mpg = mean(mpg))
# A tibble: 3 × 2
cyl av_mpg
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
is exactly the same as
mtcars |>
group_by(cyl) |>
summarise(av_mpg = mean(mpg))
# A tibble: 3 × 2
cyl av_mpg
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
The problem for me, and the reason that I don’t think I’ll be switching to use |>
in the near future is the .
notation for when we don’t want to pipe into the first argument. I quite often find myself piping into a function where the data isn’t the first argument. To provide an example, let’s create a small dataset.
ex_df <- tibble(
grp1 = rnorm(100, mean = 10, sd = 5),
grp2 = rnorm(100, mean = 20, sd = 5)
)
head(ex_df)
# A tibble: 6 × 2
grp1 grp2
<dbl> <dbl>
1 1.63 14.3
2 5.09 8.97
3 24.2 19.6
4 4.13 31.0
5 10.2 27.0
6 7.96 32.1
Suppose we wanted to perform a t-test. We have a number of choices
t.test(ex_df$grp1, ex_df$grp2)
Welch Two Sample t-test
data: ex_df$grp1 and ex_df$grp2
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.784637 -8.824786
sample estimates:
mean of x mean of y
9.962203 20.266915
Alternatively we might want to change the data to long format, maybe add a cleaning step (which isn’t needed for this example) and run the test using the formula notation.
ex_df %>%
pivot_longer(everything(), names_to = "group", values_to = "value") %>%
# additional cleaning step ... %>%
t.test(value ~ group, data = .)
Welch Two Sample t-test
data: value by group
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means between group grp1 and group grp2 is not equal to 0
95 percent confidence interval:
-11.784637 -8.824786
sample estimates:
mean in group grp1 mean in group grp2
9.962203 20.266915
Now if we try that with |>
we know in advance that it won’t work but I’m doing it here to generate the error
ex_df |>
pivot_longer(everything(), names_to = "group", values_to = "value") |>
# additional cleaning step ... |>
t.test(value ~ group, data = .)
Error in `vec_c()`:
! Can't combine `group` <character> and `value` <double>.
We can get around this if we really want, with an anonymous function.
ex_df |>
pivot_longer(everything(), names_to = "group", values_to = "value") |>
(\(df) t.test(value ~ group, data = df))()
Welch Two Sample t-test
data: value by group
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means between group grp1 and group grp2 is not equal to 0
95 percent confidence interval:
-11.784637 -8.824786
sample estimates:
mean in group grp1 mean in group grp2
9.962203 20.266915
I quite like the notation of the anonymous functions but I’m not going to go into detail on them here because others have already provided excellent explanations. The blog post I linked above is one of them and I encourage you to read that too.
The conclusion I came to after this little bit of reading was that I’ll stick to the magrittr pipe %>%
, as i think the notation it leads to (include the .
notation) is easier to read. The exception to this would be if I was writing a package. In that case I do try to limit the dependencies and it might be worthwhile using |>
and anonymous functions.
As part of the reading I did, I also revisited the other pipes available from the magrittr
package and it was really good reminder of what the options were. The ones I think I’ll end up using the most are the assignment pipe %<>%
and what I’ll probably end up column the column pipe (the exposition pipe) %$%
.
The %$%
provides an alternative to the above t.test()
notation as it allows us to access the columns directly.
library(magrittr)
ex_df %$%
t.test(grp1, grp2)
Welch Two Sample t-test
data: grp1 and grp2
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.784637 -8.824786
sample estimates:
mean of x mean of y
9.962203 20.266915
The assignment pipe saves me some typing for things I do a lot by replacing
ex_df <- ex_df %>%
pivot_longer(everything(), names_to = "group", values_to = "value")
with
ex_df %<>% pivot_longer(everything(), names_to = "group", values_to = "value")
head(ex_df)
# A tibble: 6 × 2
group value
<chr> <dbl>
1 grp1 1.63
2 grp2 14.3
3 grp1 5.09
4 grp2 8.97
5 grp1 24.2
6 grp2 19.6
If you are reading this and haven’t already, then I would really encourage you to read the blog post. I’m a big fan of the pipe and will continue to use it but for my personal work, I think I’ll stick with the magrittr
pipe rather than switch to the base pipe.