Although we can perform dataset preprocessing inside the feature functions (e.g. clustering GPS points), it can be useful to perform preprocessing on the dataset once. Especially, if this takes a long time and is done again and again for different feature functions. To make sure that this is done for each ID of a grouping variable individually, it is safer to use the method shown here.

StudentLife Data

In this example we cluster the gps points and add a new column with the clusters.

For simplicity reasons, we only use the gps data of this dataset:

library(fxtract)
library(dplyr)
gps_data = studentlife_small %>% select(userId, latitude, longitude) %>% filter(!is.na(latitude))
head(gps_data)
##   userId latitude longitude
## 1     00 43.75913 -72.32924
## 2     00 43.75950 -72.32902
## 3     00 43.75913 -72.32924
## 4     00 43.75913 -72.32924
## 5     00 43.75913 -72.32924
## 6     00 43.75913 -72.32924

Define a Function

We need to define a function which has a dataframe as input and the preprocessed dataframe as output. The method $preprocess_data will then read the RDS files for each ID of the grouping variable, apply the function on each dataframe individually and save those as RDS files again. Parallelization is available via future.

Perform Preprocessing

The data has successfully been preprocessed:

head(xtractor$get_data())
##   userId latitude longitude cluster
## 1     00 43.75913 -72.32924       1
## 2     00 43.75950 -72.32902       1
## 3     00 43.75913 -72.32924       1
## 4     00 43.75913 -72.32924       1
## 5     00 43.75913 -72.32924       1
## 6     00 43.75913 -72.32924       1