Although we can perform dataset preprocessing inside the feature functions (e.g. clustering GPS points), it can be useful to perform preprocessing on the dataset once. Especially, if this takes a long time and is done again and again for different feature functions. To make sure that this is done for each ID of a grouping variable individually, it is safer to use the method shown here.

StudentLife Data

In this example we cluster the gps points and add a new column with the clusters.

For simplicity reasons, we only use the gps data of this dataset:

library(fxtract)
library(dplyr)
gps_data = studentlife_small %>% select(userId, latitude, longitude) %>% filter(!is.na(latitude))
head(gps_data)
##   userId latitude longitude
## 1     00 43.75913 -72.32924
## 2     00 43.75950 -72.32902
## 3     00 43.75913 -72.32924
## 4     00 43.75913 -72.32924
## 5     00 43.75913 -72.32924
## 6     00 43.75913 -72.32924
library(fxtract)
xtractor = Xtractor$new("xtractor")
xtractor$add_data(gps_data, group_by = "userId")

Define a Function

We need to define a function which has a dataframe as input and the preprocessed dataframe as output. The method $preprocess_data will then read the RDS files for each ID of the grouping variable, apply the function on each dataframe individually and save those as RDS files again. Parallelization is available via future.

library(fpc)
fun = function(data) {
  lat = data$latitude
  lon = data$longitude
  clust = dbscan(cbind(lat, lon), eps = 1.5, MinPts = 3)
  data$cluster = clust$cluster
  return(data)
}

Perform Preprocessing

xtractor$preprocess_data(fun = fun)

The data has successfully been preprocessed:

head(xtractor$get_data())
##   userId latitude longitude cluster
## 1     00 43.75913 -72.32924       1
## 2     00 43.75950 -72.32902       1
## 3     00 43.75913 -72.32924       1
## 4     00 43.75913 -72.32924       1
## 5     00 43.75913 -72.32924       1
## 6     00 43.75913 -72.32924       1