Xtractor calculates features from raw data for each ID of a grouping variable individually. This process can be parallelized with the package future.

Format

R6Class object.

Usage

xtractor = Xtractor$new("xtractor")

Arguments

For Xtractor$new():

name:

(`character(1)`): A user defined name of the Xtractor. All necessary data will be saved on the path: ./fxtract_files/name/

load:

(`logical(1)`): If TRUE, an existing Xtractor will be loaded.

file.dir:

(`character(1)`): Path where all files of the Xtractor are saved. Default is the current working directory.

Details

All datasets and feature functions are saved in this R6 object. Datasets will be saved as single RDS files (for each ID) and feature functions are calculated on each single dataset. A big advantage of this method is that it scales nicely for larger datasets. Data is only read into RAM, when needed.

Fields

error_messages:

(`data.frame()`): Active binding. A dataframe with information about error messages.

ids:

(`character()`): Active binding. A character vector with the IDs of the grouping variable.

features:

(`character()`): Active binding. A character vector with the feature functions which were added.

status:

(`data.frame()`): Active binding. A dataframe with an overview over which features are calculated on which datasets.

results:

(`data.frame()`): Active binding. A dataframe with all calculated features of all IDs.

Methods

add_data(data, group_by)

[data: (`data.frame` | `data.table`)] A dataframe or data.table which shall be added to the R6 object.
[group_by: (`character(1)`)] The grouping variable's name of the dataframe.

This method writes single RDS files for each group.

preprocess_data(fun)

[fun: (`function`)] A function, which has a dataframe as input and a dataframe as output.

This method loads the RDS files and applies this function on them. The old RDS files are overwritten.

remove_data(ids)

[ids: (`character()`)] One or many IDs of the grouping variable.

This method deletes the RDS files of the given IDs.

get_data(ids)

[ids: (`character()`)] One or many IDs of the grouping variable.

This method returns one dataframe with the chosen IDs.

add_feature(fun, check_fun)

[fun: (`function`)] A function, which has a dataframe as input and a named vector or list as output.
[check_fun: (`logical(1)`)] The function will be checked if it returns a vector or a list. Defaults to TRUE. Disable, if calculation takes too long.

This method adds the feature function to the R6 object. It writes an RDS file of the function which can be retrieved later.

remove_feature(fun)

[fun: (`function | character(1)`)] A function (or the name of the function as character) which shall be removed.

This method removes the function from the object and deletes all corresponding files and results.

get_feature(fun)

[fun: (`character(1)`)] The name of a function as character.

This method reads the RDS file of the function. Useful for debugging after loading an Xtractor.

calc_features(features, ids)

[features: (`character()`)] A character vector of the names of the features which shall be calculated. Defaults to all features.
[ids: (`character()`)] One or many IDs of the grouping variable. Defaults to all IDs.

This method calculates all features on the chosen IDs.

retry_failed_features(features)

[features: (`character()`)] A character vector of the names of the features which shall be calculated. Defaults to all features.

This method retries calculation of failed features. Useful if calculation failed because of memory problems.

plot()

[internal] method to print the R6 object.

clone()

[internal] method to clone the R6 object.

initialize()

[internal] method to initialize the R6 object.

Examples

# one feature function dir = tempdir() xtractor = Xtractor$new("xtractor", file.dir = dir) xtractor$add_data(iris, group_by = "Species")
#> Warning: `distinct_()` is deprecated as of dplyr 0.7.0. #> Please use `distinct()` instead. #> See vignette('programming') for more help #> This warning is displayed once every 8 hours. #> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Saving raw RDS files.
#> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
xtractor$ids
#> [1] "setosa" "versicolor" "virginica"
fun = function(data) { c(mean_sepal_length = mean(data$Sepal.Length)) } xtractor$add_feature(fun) xtractor$features
#> [1] "fun"
xtractor$calc_features()
#> Calculating features on 1 core(s).
#> Calculating feature function: fun
xtractor$results
#> Species mean_sepal_length #> 1 setosa 5.006 #> 2 versicolor 5.936 #> 3 virginica 6.588
xtractor$status
#> Species fun #> 1 setosa done #> 2 versicolor done #> 3 virginica done
xtractor
#> R6 Object: Xtractor #> Name: xtractor #> Grouping variable: Species #> IDs: setosa, versicolor, virginica #> Feature functions: fun #> Extraction done: 100% #> Errors during calculation: 0
# failing function on only one ID fun2 = function(data) { if ("setosa" %in% data$Species) stop("my error") c(sd_sepal_length = sd(data$Sepal.Length)) } xtractor$add_feature(fun2) xtractor$calc_features()
#> Feature function 'fun' was already applied on every ID and will be skipped.
#> Calculating features on 1 core(s).
#> Calculating feature function: fun2
xtractor$results
#> Species mean_sepal_length sd_sepal_length #> 1 setosa 5.006 NA #> 2 versicolor 5.936 0.5161711 #> 3 virginica 6.588 0.6358796
xtractor$error_messages
#> feature_function id error_message #> 1 fun2 setosa my error
xtractor
#> R6 Object: Xtractor #> Name: xtractor #> Grouping variable: Species #> IDs: setosa, versicolor, virginica #> Feature functions: fun, fun2 #> Extraction done: 83.3333333333333% #> Errors during calculation: 1
# remove feature function xtractor$remove_feature("fun2") xtractor$results
#> Species mean_sepal_length #> 1 setosa 5.006 #> 2 versicolor 5.936 #> 3 virginica 6.588
xtractor
#> R6 Object: Xtractor #> Name: xtractor #> Grouping variable: Species #> IDs: setosa, versicolor, virginica #> Feature functions: fun #> Extraction done: 100% #> Errors during calculation: 0
# remove ID xtractor$remove_data("setosa")
#> Deleting RDS file setosa.RDS
#> Deleting results from id: setosa
xtractor$results
#> Species mean_sepal_length #> 1 versicolor 5.936 #> 2 virginica 6.588
xtractor$ids
#> [1] "versicolor" "virginica"
xtractor
#> R6 Object: Xtractor #> Name: xtractor #> Grouping variable: Species #> IDs: versicolor, virginica #> Feature functions: fun #> Extraction done: 100% #> Errors during calculation: 0
# get datasets and functions fun3 = xtractor$get_feature("fun") df = xtractor$get_data() dplyr_wrapper(data = df, group_by = "Species", fun = fun3)
#> Species mean_sepal_length #> 1 versicolor 5.936 #> 2 virginica 6.588