vignettes/tutorial/xtractor_preprocess_data.Rmd
xtractor_preprocess_data.Rmd
Although we can perform dataset preprocessing inside the feature functions (e.g. clustering GPS points), it can be useful to perform preprocessing on the dataset once. Especially, if this takes a long time and is done again and again for different feature functions. To make sure that this is done for each ID of a grouping variable individually, it is safer to use the method shown here.
In this example we cluster the gps points and add a new column with the clusters.
For simplicity reasons, we only use the gps data of this dataset:
library(fxtract) library(dplyr) gps_data = studentlife_small %>% select(userId, latitude, longitude) %>% filter(!is.na(latitude)) head(gps_data)
## userId latitude longitude
## 1 00 43.75913 -72.32924
## 2 00 43.75950 -72.32902
## 3 00 43.75913 -72.32924
## 4 00 43.75913 -72.32924
## 5 00 43.75913 -72.32924
## 6 00 43.75913 -72.32924
library(fxtract) xtractor = Xtractor$new("xtractor") xtractor$add_data(gps_data, group_by = "userId")
We need to define a function which has a dataframe as input and the preprocessed dataframe as output. The method $preprocess_data
will then read the RDS files for each ID of the grouping variable, apply the function on each dataframe individually and save those as RDS files again. Parallelization is available via future
.
xtractor$preprocess_data(fun = fun)
The data has successfully been preprocessed:
head(xtractor$get_data())
## userId latitude longitude cluster
## 1 00 43.75913 -72.32924 1
## 2 00 43.75950 -72.32902 1
## 3 00 43.75913 -72.32924 1
## 4 00 43.75913 -72.32924 1
## 5 00 43.75913 -72.32924 1
## 6 00 43.75913 -72.32924 1