Data Frame Manipulation -- R data.table¶
Basic data frame manipulation using base R data.table.
In [1]:
Copied!
library(data.table)
library(magrittr)
packageVersion("data.table")
library(data.table)
library(magrittr)
packageVersion("data.table")
[1] ‘1.14.0’
Load csv¶
In [2]:
Copied!
dt <- data.table::fread(here::here(".data", "titanic.csv"))
head(dt, 3)
dt <- data.table::fread(here::here(".data", "titanic.csv"))
head(dt, 3)
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> | <int> | <chr> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> |
| 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NA | St Louis, MO |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON |
| 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | Montreal, PQ / Chesterville, ON |
Summary¶
In [3]:
Copied!
# Rows
nrow(dt)
# Rows
nrow(dt)
1309
In [4]:
Copied!
# Columns
length(dt)
# Columns
length(dt)
14
In [5]:
Copied!
# Inspect types
str(dt)
# Inspect types
str(dt)
Classes ‘data.table’ and 'data.frame': 1309 obs. of 14 variables: $ pclass : int 1 1 1 1 1 1 1 1 1 1 ... $ survived : int 1 1 0 0 0 1 1 0 1 0 ... $ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ... $ sex : chr "female" "male" "female" "male" ... $ age : num 29 0.917 2 30 25 ... $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ... $ parch : int 0 2 2 2 2 0 0 0 0 0 ... $ ticket : chr "24160" "113781" "113781" "113781" ... $ fare : num 211 152 152 152 152 ... $ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ... $ embarked : chr "S" "S" "S" "S" ... $ boat : chr "2" "11" "" "" ... $ body : int NA NA NA 135 NA NA NA NA NA 22 ... $ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ... - attr(*, ".internal.selfref")=<externalptr>
In [6]:
Copied!
# Descriptive statistics
summary(dt)
# Descriptive statistics
summary(dt)
pclass survived name sex
Min. :1.000 Min. :0.000 Length:1309 Length:1309
1st Qu.:2.000 1st Qu.:0.000 Class :character Class :character
Median :3.000 Median :0.000 Mode :character Mode :character
Mean :2.295 Mean :0.382
3rd Qu.:3.000 3rd Qu.:1.000
Max. :3.000 Max. :1.000
age sibsp parch ticket
Min. : 0.1667 Min. :0.0000 Min. :0.000 Length:1309
1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 Class :character
Median :28.0000 Median :0.0000 Median :0.000 Mode :character
Mean :29.8811 Mean :0.4989 Mean :0.385
3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000
Max. :80.0000 Max. :8.0000 Max. :9.000
NA's :263
fare cabin embarked boat
Min. : 0.000 Length:1309 Length:1309 Length:1309
1st Qu.: 7.896 Class :character Class :character Class :character
Median : 14.454 Mode :character Mode :character Mode :character
Mean : 33.295
3rd Qu.: 31.275
Max. :512.329
NA's :1
body home.dest
Min. : 1.0 Length:1309
1st Qu.: 72.0 Class :character
Median :155.0 Mode :character
Mean :160.8
3rd Qu.:256.0
Max. :328.0
NA's :1188
Slice data¶
In [7]:
Copied!
# Select rows by number
dt[2:4,]
# Select rows by number
dt[2:4,]
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> | <int> | <chr> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON |
| 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | NA | Montreal, PQ / Chesterville, ON | |
| 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | 135 | Montreal, PQ / Chesterville, ON |
In [8]:
Copied!
# Select rows by condition
dt[survived == 1,] %>% head(4)
# Select rows by condition
dt[survived == 1,] %>% head(4)
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> | <int> | <chr> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> |
| 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NA | St Louis, MO |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON |
| 1 | 1 | Anderson, Mr. Harry | male | 48.0000 | 0 | 0 | 19952 | 26.5500 | E12 | S | 3 | NA | New York, NY |
| 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0000 | 1 | 0 | 13502 | 77.9583 | D7 | S | 10 | NA | Hudson, NY |
In [9]:
Copied!
# Select rows by multiple conditions
dt[survived == 1 & sex == "female",] %>% head(3)
# Select rows by multiple conditions
dt[survived == 1 & sex == "female",] %>% head(3)
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> | <int> | <chr> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> |
| 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NA | St Louis, MO |
| 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63 | 1 | 0 | 13502 | 77.9583 | D7 | S | 10 | NA | Hudson, NY |
| 1 | 1 | Appleton, Mrs. Edward Dale (Charlotte Lamson) | female | 53 | 2 | 0 | 11769 | 51.4792 | C101 | S | D | NA | Bayside, Queens, NY |
In [10]:
Copied!
# Select columns by name
dt[, .(survived, pclass)] %>% head(3)
# Select columns by name
dt[, .(survived, pclass)] %>% head(3)
| survived | pclass |
|---|---|
| <int> | <int> |
| 1 | 1 |
| 1 | 1 |
| 0 | 1 |
Assign values¶
In [11]:
Copied!
# Assign a (new) column
dt[, sparkles := 8]
head(dt, 3)
# Assign a (new) column
dt[, sparkles := 8]
head(dt, 3)
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | sparkles |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> | <int> | <chr> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> | <dbl> |
| 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NA | St Louis, MO | 8 |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON | 8 |
| 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | Montreal, PQ / Chesterville, ON | 8 |
In [12]:
Copied!
# Assign value to some rows
dt[survived == 1, sparkles := 3]
head(dt, 3)
# Assign value to some rows
dt[survived == 1, sparkles := 3]
head(dt, 3)
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | sparkles |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> | <int> | <chr> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> | <dbl> |
| 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NA | St Louis, MO | 3 |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON | 3 |
| 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | Montreal, PQ / Chesterville, ON | 8 |
Group by¶
In [13]:
Copied!
dt[, .(
min_age = min(age, na.rm = TRUE),
mean_fare = mean(fare, na.rm = TRUE),
count = .N
), by = pclass]
dt[, .(
min_age = min(age, na.rm = TRUE),
mean_fare = mean(fare, na.rm = TRUE),
count = .N
), by = pclass]
| pclass | min_age | mean_fare | count |
|---|---|---|---|
| <int> | <dbl> | <dbl> | <int> |
| 1 | 0.9167 | 87.50899 | 323 |
| 2 | 0.6667 | 21.17920 | 277 |
| 3 | 0.1667 | 13.30289 | 709 |
Pivot table¶
Reshape¶
Spread / Cast / Pivot¶
In [14]:
Copied!
# Count Pivot Table
data.table::dcast(dt, sex ~ pclass)
# Count Pivot Table
data.table::dcast(dt, sex ~ pclass)
Using 'sparkles' as value column. Use 'value.var' to override Aggregate function missing, defaulting to 'length'
| sex | 1 | 2 | 3 |
|---|---|---|---|
| <chr> | <int> | <int> | <int> |
| female | 144 | 106 | 216 |
| male | 179 | 171 | 493 |
In [15]:
Copied!
# Aggegate Pivot Table
dt_wide <- data.table::dcast(dt, sex ~ pclass, value.var = "age", fun.aggregate = median, na.rm = TRUE)
dt_wide
# Aggegate Pivot Table
dt_wide <- data.table::dcast(dt, sex ~ pclass, value.var = "age", fun.aggregate = median, na.rm = TRUE)
dt_wide
| sex | 1 | 2 | 3 |
|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> |
| female | 36 | 28.0 | 22 |
| male | 42 | 29.5 | 25 |
Gather / Melt¶
In [16]:
Copied!
data.table::melt(dt_wide, id.vars = "sex", measure.vars = c("1", "2", "3"), variable.name = "pclass", value.name = "med_age")
data.table::melt(dt_wide, id.vars = "sex", measure.vars = c("1", "2", "3"), variable.name = "pclass", value.name = "med_age")
| sex | pclass | med_age |
|---|---|---|
| <chr> | <fct> | <dbl> |
| female | 1 | 36.0 |
| male | 1 | 42.0 |
| female | 2 | 28.0 |
| male | 2 | 29.5 |
| female | 3 | 22.0 |
| male | 3 | 25.0 |
Join¶
In [17]:
Copied!
left_dt = dt[1:100, .(name, sex, age)]
right_dt = dt[1:100, .(name, pclass, fare)]
merge(
left_dt,
right_dt,
by = 'name', # by.x, by.y
all = FALSE # all.x, all.y
) %>% head()
left_dt = dt[1:100, .(name, sex, age)]
right_dt = dt[1:100, .(name, pclass, fare)]
merge(
left_dt,
right_dt,
by = 'name', # by.x, by.y
all = FALSE # all.x, all.y
) %>% head()
| name | sex | age | pclass | fare |
|---|---|---|---|---|
| <chr> | <chr> | <dbl> | <int> | <dbl> |
| Allen, Miss. Elisabeth Walton | female | 29.0000 | 1 | 211.3375 |
| Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 151.5500 |
| Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 151.5500 |
| Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 151.5500 |
| Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 151.5500 |
| Anderson, Mr. Harry | male | 48.0000 | 1 | 26.5500 |
Rolling Join / As-of Join¶
https://www.r-bloggers.com/understanding-data-table-rolling-joins/
Note that R's bracket join syntax has the opposite sense of left and right. In this example, we do not use R's convention for naming. We mean left like a left join -- rows of the left dataframe will be the rows of the output.
In [18]:
Copied!
# Setup
left_dt = data.table::data.table(t = c(1, 5, 10), left_value = c('a', 'b', 'c'))
data.table::setkeyv(left_dt, "t")
right_dt = data.table::data.table(t = c(1, 2, 3, 6, 7), right_value = c(1, 2, 3, 6, 7))
data.table::setkeyv(right_dt, "t")
# Setup
left_dt = data.table::data.table(t = c(1, 5, 10), left_value = c('a', 'b', 'c'))
data.table::setkeyv(left_dt, "t")
right_dt = data.table::data.table(t = c(1, 2, 3, 6, 7), right_value = c(1, 2, 3, 6, 7))
data.table::setkeyv(right_dt, "t")
In [19]:
Copied!
# Join in rows of right_dt with latest value of t before
right_dt[left_dt, roll = Inf]
# Note bracket join syntax being backwards: select left_dt's rows, from right_dt
# This is reverse join in R's convention, but a forward join in Python's.
# Join in rows of right_dt with latest value of t before
right_dt[left_dt, roll = Inf]
# Note bracket join syntax being backwards: select left_dt's rows, from right_dt
# This is reverse join in R's convention, but a forward join in Python's.
| t | right_value | left_value |
|---|---|---|
| <dbl> | <dbl> | <chr> |
| 1 | 1 | a |
| 5 | 3 | b |
| 10 | 7 | c |
In [20]:
Copied!
# Join in rows of right_dt with next value of t after
right_dt[left_dt, roll = -Inf]
# Note bracket join syntax being backwards: select left_dt's rows, from right_dt
# This is a forward join in R's convention, but a reverse join in Python's.
# Join in rows of right_dt with next value of t after
right_dt[left_dt, roll = -Inf]
# Note bracket join syntax being backwards: select left_dt's rows, from right_dt
# This is a forward join in R's convention, but a reverse join in Python's.
| t | right_value | left_value |
|---|---|---|
| <dbl> | <dbl> | <chr> |
| 1 | 1 | a |
| 5 | 6 | b |
| 10 | NA | c |
Note that roll can be set to a value to limit the window of rolling
For rolling in both directions, use rollends argument.
Row-bind¶
In [21]:
Copied!
dt1 = dt[1:3]
dt2 = dt[4:6]
data.table::rbindlist(list(dt1, dt2), use.names = TRUE)
# fill = TRUE to include disjoint columns and fill with NA
dt1 = dt[1:3]
dt2 = dt[4:6]
data.table::rbindlist(list(dt1, dt2), use.names = TRUE)
# fill = TRUE to include disjoint columns and fill with NA
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | sparkles |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> | <int> | <chr> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> | <dbl> |
| 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NA | St Louis, MO | 3 |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON | 3 |
| 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | Montreal, PQ / Chesterville, ON | 8 | |
| 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 135 | Montreal, PQ / Chesterville, ON | 8 | |
| 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | Montreal, PQ / Chesterville, ON | 8 | |
| 1 | 1 | Anderson, Mr. Harry | male | 48.0000 | 0 | 0 | 19952 | 26.5500 | E12 | S | 3 | NA | New York, NY | 3 |
Column-bind¶
In [22]:
Copied!
dt_left = dt[0:3, 1:3]
dt_right = dt[0:3, 4:6]
cbind(dt_left, dt_right)
dt_left = dt[0:3, 1:3]
dt_right = dt[0:3, 4:6]
cbind(dt_left, dt_right)
| pclass | survived | name | sex | age | sibsp |
|---|---|---|---|---|---|
| <int> | <int> | <chr> | <chr> | <dbl> | <int> |
| 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 |
| 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 |
| 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 |
In [ ]:
Copied!