• Designed for large projects.
  • Data and features can be updated easily.
  • Data can be preprocessed.
  • Features are calculated on each ID of a grouping variable individually.
  • Easy parallelization with future.
  • Scales nicely for larger datasets. Data is only read into RAM, when needed.
library(fxtract)
xtractor = Xtractor$new("xtractor")

Add Data

Data must be added as dataframes with $add_data, where the grouping variable must be specified. You can also add dataframes for each ID individually. This is especially helpful for large datasets.

  • Add all data at once:
xtractor$add_data(iris, group_by = "Species")
  • Add datasets individually:
library(dplyr)
for (i in unique(iris$Species)) {
  iris_i = iris %>% filter(Species == i)
  xtractor$add_data(iris_i, group_by = "Species")
}

Add Features

Features must be added as functions which have a dataframe as input and a named vector as output. A named list with atomic entries of length 1 is also allowed as output (useful for numerical and categorical outputs). This function will be calculated for each ID of a grouping variable individually.

fun1 = function(data) {
  c(mean_sepal_length = mean(data$Sepal.Length),
    sd_sepal_length = sd(data$Sepal.Length))
}

fun2 = function(data) {
  list(mean_petal_length = mean(data$Petal.Length),
    sd_petal_length = sd(data$Petal.Length))
}
xtractor$add_feature(fun1)
xtractor$add_feature(fun2)

Calculate Features

Features are calculated by the method $calc_features():

xtractor$calc_features()

Collect Results

The desired final dataframe can be accessed by the slot $results:

xtractor$results
##      Species mean_sepal_length sd_sepal_length mean_petal_length
## 1     setosa             5.006       0.3524897             1.462
## 2 versicolor             5.936       0.5161711             4.260
## 3  virginica             6.588       0.6358796             5.552
##   sd_petal_length
## 1       0.1736640
## 2       0.4699110
## 3       0.5518947

Parallelization

Parallelization is realized with the package future Feature calculation and preprocessing data will be parallelized. For Windows and Linux machines you can parallelize like the following:

Use all cores

library(future)
plan(multisession)
future::nbrOfWorkers()
## system 
##      2

Set number of cores

plan(multisession, workers = 4)
future::nbrOfWorkers()
## [1] 4

Stop parallelization

plan(sequential)
future::nbrOfWorkers()
## [1] 1