A Discussion on Pipes

May 31, 2023    dplyr

I’m a little late to the party on this but I thought that I would add a post about it because it is something that relates to my teaching and I was slightly surprised by the conclusion I ended up coming to. The topic is piping and the strengths and weaknesses of the magrittr pipe %>% versus the (relatively) new base pipe |>.

The background to this post is that I was making a general move towards switching from %>% to |>. That is, when I started writing some brand new code in a brand new RStudio project (and git repo), I changed to using |>. Some of this new code generated a weird error and on fixing it I ended up doing a deeper dive into the base pipe and that led to this post.

Before I get too much further I would just add that I am a big fan of the pipe and piped operations. In my opinion it generates far more readable code than base R (prior to the pipe). I personally use it all the time and introduce it to my students as soon as I think is reasonably sensible.

Anyway, onto the discussion and the secondary reason for the post - to give me something to refer back to when I forget the specifics. I’ve used the magrittr pipe %>% for ages and the example below (from the old version of the map() help page) is a good example of where I might use it.

library(tidyverse)

mtcars %>% 
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map_dbl("r.squared")
        4         6         8 
0.5086326 0.4645102 0.4229655 

On digging into the base pipe, which I was using as part of map() call, I used ?map and got the updated version of the above code.

mtcars |>
  group_split(mtcars$cyl) |>
  map(\(df) lm(mpg ~ wt, data = df)) |>
  map(summary) |>
  map_dbl("r.squared")
[1] 0.5086326 0.4645102 0.4229655

These two pieces of code look reasonably similar but you’ll notice the lack of . notation in the second. This led me reading about |> and the differences. In turn I found this excellent blog post that provided better examples than I found in other documentation. For me the crux of the differences between |> and %>% are that:

  • the base pipe only ever passes the left-hand side of the pipe into the first argument of the right-hand side;
  • and (because of this) we can no longer use the . notation.

The anonymous function defined above as \(df) lm(mpg ~ wt, data = df)) is way around the issue with . notation. For the most part, the first point above (LHS into RHS) is rarely going to be an issue, particularly with dplyr functions. For example:

mtcars %>%
  group_by(cyl) %>%
  summarise(av_mpg = mean(mpg))
# A tibble: 3 × 2
    cyl av_mpg
  <dbl>  <dbl>
1     4   26.7
2     6   19.7
3     8   15.1

is exactly the same as

mtcars |>
  group_by(cyl) |>
  summarise(av_mpg = mean(mpg))
# A tibble: 3 × 2
    cyl av_mpg
  <dbl>  <dbl>
1     4   26.7
2     6   19.7
3     8   15.1

The problem for me, and the reason that I don’t think I’ll be switching to use |> in the near future is the . notation for when we don’t want to pipe into the first argument. I quite often find myself piping into a function where the data isn’t the first argument. To provide an example, let’s create a small dataset.

ex_df <- tibble(
  grp1 = rnorm(100, mean = 10, sd = 5), 
  grp2 = rnorm(100, mean = 20, sd = 5)
)
head(ex_df)
# A tibble: 6 × 2
   grp1  grp2
  <dbl> <dbl>
1  1.63 14.3 
2  5.09  8.97
3 24.2  19.6 
4  4.13 31.0 
5 10.2  27.0 
6  7.96 32.1 

Suppose we wanted to perform a t-test. We have a number of choices

t.test(ex_df$grp1, ex_df$grp2)

	Welch Two Sample t-test

data:  ex_df$grp1 and ex_df$grp2
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.784637  -8.824786
sample estimates:
mean of x mean of y 
 9.962203 20.266915 

Alternatively we might want to change the data to long format, maybe add a cleaning step (which isn’t needed for this example) and run the test using the formula notation.

ex_df %>%
  pivot_longer(everything(), names_to = "group", values_to = "value") %>%
  # additional cleaning step ... %>%
  t.test(value ~ group, data = .)

	Welch Two Sample t-test

data:  value by group
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means between group grp1 and group grp2 is not equal to 0
95 percent confidence interval:
 -11.784637  -8.824786
sample estimates:
mean in group grp1 mean in group grp2 
          9.962203          20.266915 

Now if we try that with |> we know in advance that it won’t work but I’m doing it here to generate the error

ex_df |>
  pivot_longer(everything(), names_to = "group", values_to = "value") |>
  # additional cleaning step ... |>
  t.test(value ~ group, data = .)
Error in `vec_c()`:
! Can't combine `group` <character> and `value` <double>.

We can get around this if we really want, with an anonymous function.

ex_df |>
  pivot_longer(everything(), names_to = "group", values_to = "value") |>
  (\(df) t.test(value ~ group, data = df))()

	Welch Two Sample t-test

data:  value by group
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means between group grp1 and group grp2 is not equal to 0
95 percent confidence interval:
 -11.784637  -8.824786
sample estimates:
mean in group grp1 mean in group grp2 
          9.962203          20.266915 

I quite like the notation of the anonymous functions but I’m not going to go into detail on them here because others have already provided excellent explanations. The blog post I linked above is one of them and I encourage you to read that too.

The conclusion I came to after this little bit of reading was that I’ll stick to the magrittr pipe %>%, as i think the notation it leads to (include the . notation) is easier to read. The exception to this would be if I was writing a package. In that case I do try to limit the dependencies and it might be worthwhile using |> and anonymous functions.

Other Pipes

As part of the reading I did, I also revisited the other pipes available from the magrittr package and it was really good reminder of what the options were. The ones I think I’ll end up using the most are the assignment pipe %<>% and what I’ll probably end up column the column pipe (the exposition pipe) %$%.

The %$% provides an alternative to the above t.test() notation as it allows us to access the columns directly.

library(magrittr)
ex_df %$% 
  t.test(grp1, grp2)

	Welch Two Sample t-test

data:  grp1 and grp2
t = -13.731, df = 197.69, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.784637  -8.824786
sample estimates:
mean of x mean of y 
 9.962203 20.266915 

The assignment pipe saves me some typing for things I do a lot by replacing

ex_df <- ex_df %>%
   pivot_longer(everything(), names_to = "group", values_to = "value")

with

ex_df %<>% pivot_longer(everything(), names_to = "group", values_to = "value")
head(ex_df)
# A tibble: 6 × 2
  group value
  <chr> <dbl>
1 grp1   1.63
2 grp2  14.3 
3 grp1   5.09
4 grp2   8.97
5 grp1  24.2 
6 grp2  19.6 

Final Thoughts

If you are reading this and haven’t already, then I would really encourage you to read the blog post. I’m a big fan of the pipe and will continue to use it but for my personal work, I think I’ll stick with the magrittr pipe rather than switch to the base pipe.