Translate

Pages

Pages

Pages

Intro Video

Tuesday, August 11, 2020

dplyr arrange(): Sort/Reorder by One or More Variables

dplyr, R package part of tidyverse suite of packages, provides a great set of tools to manipulate datasets in the tabular form. dplyr has a set of core functions for “data munging”,including select(),mutate(), filter(), summarise(), and arrange().

And in this tidyverse tutorial, we will learn how to use dplyr’s arrange() function to sort a data frame in multiple ways. First we will start with how to sort a dataframe by values of a single variable, And then we will learn how to sort a dataframe by more than one variable in the dataframe. By default, dplyr’s arrange() sorts in ascending order, we will also learn to sort in descending order.

Let us get started by loading tidyverse, suite of R packges from RStudio.

library("tidyverse")

We will use the fantastic Penguins dataset to illustrate the three ways to see data in a dataframe. Let us load the data from cmdlinetips.com’ github page.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)
## Parsed with column specification:
## cols(
##   species = col_character(),
##   island = col_character(),
##   bill_length_mm = col_double(),
##   bill_depth_mm = col_double(),
##   flipper_length_mm = col_double(),
##   body_mass_g = col_double(),
##   sex = col_character()
## )
head(penguins)

## # A tibble: 6 x 7
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male

How To Sort a Dataframe by a single Variable with dplyr’s arrange()?

We can use dplyr’s arrange() function to sort a dataframe by one or more variables. Let us say we want to sort Penguins dataframe by its body mass to quickly learn about smallest weighing penguin and its relations to other variables.

We will use pipe operator “%>%” to feed the data to the dplyr function arrange(). We need to specify name of the variable that we want to sort dataframe. In this example, we are sorting by variable “body_mass_g”.

penguins %>% 
  arrange(body_mass_g)

dplyr’s arrange() sorts the dataframe by the variable and outputs a new dataframe (as a tibble). You can notice that the resulting dataframe is different from the original dataframe. We can see that body_mass_g column arranged from smallest to largest values.

## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Chinst… Dream            46.9          16.6              192        2700
##  2 Adelie  Biscoe           36.5          16.6              181        2850
##  3 Adelie  Biscoe           36.4          17.1              184        2850
##  4 Adelie  Biscoe           34.5          18.1              187        2900
##  5 Adelie  Dream            33.1          16.1              178        2900
##  6 Adelie  Torge…           38.6          17                188        2900
##  7 Chinst… Dream            43.2          16.6              187        2900
##  8 Adelie  Biscoe           37.9          18.6              193        2925
##  9 Adelie  Dream            37.5          18.9              179        2975
## 10 Adelie  Dream            37            16.9              185        3000
## # … with 334 more rows, and 1 more variable: sex <chr>

How To Sort or Reorder Rows in Descending Order with dplyr’s arrange()?

By default, dplyr’s arrange() sorts in ascending order. We can sort by a variable in descending order using desc() function on the variable we want to sort by. For example, to sort the dataframe by body_mass_g in descending order we use

penguins %>%
 arrange(desc(body_mass_g))

## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Gentoo  Biscoe           49.2          15.2              221        6300
##  2 Gentoo  Biscoe           59.6          17                230        6050
##  3 Gentoo  Biscoe           51.1          16.3              220        6000
##  4 Gentoo  Biscoe           48.8          16.2              222        6000
##  5 Gentoo  Biscoe           45.2          16.4              223        5950
##  6 Gentoo  Biscoe           49.8          15.9              229        5950
##  7 Gentoo  Biscoe           48.4          14.6              213        5850
##  8 Gentoo  Biscoe           49.3          15.7              217        5850
##  9 Gentoo  Biscoe           55.1          16                230        5850
## 10 Gentoo  Biscoe           49.5          16.2              229        5800
## # … with 334 more rows, and 1 more variable: sex <chr>

How To Sort a Dataframe by Two Variables?

With dplyr’s arrange() function we can sort by more than one variable. To sort or arrange by two variables, we specify the names of two variables as arguments to arrange() function as shown below. Note that the order matters here.

penguins %>% 
   arrange(body_mass_g,flipper_length_mm)

In this example here, we have body_mass_g first and flipper_length_mm second. dplyr’s arrange() sorts by these two variables such that for each value the first variable, dplyr under the good subsets the data and sorts by second variable.

For example, we can see that starting from second row body_mass_g has the same values and the flipper_length is sorted in ascending order.


## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Chinst… Dream            46.9          16.6              192        2700
##  2 Adelie  Biscoe           36.5          16.6              181        2850
##  3 Adelie  Biscoe           36.4          17.1              184        2850
##  4 Adelie  Dream            33.1          16.1              178        2900
##  5 Adelie  Biscoe           34.5          18.1              187        2900
##  6 Chinst… Dream            43.2          16.6              187        2900
##  7 Adelie  Torge…           38.6          17                188        2900
##  8 Adelie  Biscoe           37.9          18.6              193        2925
##  9 Adelie  Dream            37.5          18.9              179        2975
## 10 Adelie  Dream            37            16.9              185        3000
## # … with 334 more rows, and 1 more variable: sex <chr>

Notice the difference in results we get by changing the order of two variables we want to sort by. In the example below we have flipper_length first and body_mass next.

penguins %>%
  arrange(flipper_length_mm,body_mass_g)

Now our dataframe is first sorted by flipper_length and then by body_mass.

## # A tibble: 344 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl>
##  1 Adelie  Biscoe           37.9          18.6              172        3150
##  2 Adelie  Biscoe           37.8          18.3              174        3400
##  3 Adelie  Torge…           40.2          17                176        3450
##  4 Adelie  Dream            33.1          16.1              178        2900
##  5 Adelie  Dream            39.5          16.7              178        3250
##  6 Chinst… Dream            46.1          18.2              178        3250
##  7 Adelie  Dream            37.2          18.1              178        3900
##  8 Adelie  Dream            37.5          18.9              179        2975
##  9 Adelie  Dream            42.2          18.5              180        3550
## 10 Adelie  Biscoe           37.7          18.7              180        3600
## # … with 334 more rows, and 1 more variable: sex <chr>

The post dplyr arrange(): Sort/Reorder by One or More Variables appeared first on Python and R Tips.



from Python and R Tips https://ift.tt/3ivjkSs
via Gabe's MusingsGabe's Musings