Basic pre-processing of dataframes — preproc

This function performs basic pre-processing of a given dataframe:

Data cleaning:
- Handling values below the limit of detection (LOD) or quantification (LOQ).
- Handling missing values.
Variable transformations:
- Standardization (with custom center and scale functions).
- Correcting for urine dilution with creatinine.
- Outcome bounding (for e.g., TMLE).

preproc_data(
  dat,
  dat_desc,
  covariates,
  outcome,
  dat_llodq,
  dic_steps,
  id_var,
  by_var
)

Arguments

dat

A dataframe containing the variables of interest. A dataframe.

dat_desc

A optional dataframe containing the same variables but with information on the type of values (e.g., 2 for value <LOD). This is necessary to distinguish the type of missing values (e.g., missing because <LOD or because no sample was available). A dataframe.

covariates

A dataframe containing additional variables. A dataframe.

outcome

A string indicating the outcome variable. A string.

dat_llodq

A optional dataframe containing the LOD/LOQ values (val) for the variables (var) of interest. A dataframe.

dic_steps

A ordered, nested named list of steps to perform. A list. It can include the following elements:

llodq, to handle values <LOD/LOQ. A named list with elements:
id_val, which values in dat_desc should be considered. A vector of integers.
method, the method to be used. Currently, only a replacement approach is supported. A string.
divide_by, .
creatinine_threshold, subjects for which the creatinine levels are below this value will not be processed. Currently not used. A double.
threshold_within, the threshold within each group for the fraction of values corresponding to id_val. An integer.
threshold_overall, the overall threshold for the fraction of values corresponding to id_val. An integer.
tune_sigma, currently not supported. A double.
missings, to handle missing values. A named list with elements:
threshold_within, the missing value threshold within each group. An integer.
threshold_overall, the overall missing value threshold. An integer.
use_additional_covariates, .
selected_covariates, a vector of covariates' names. A vector.
method_imputation, method to be used to impute values. A string.
k, the number of nearest neighbors to use for kNN. An integer.
path_save_res, path to directory where to save figures. Currently, variables can be imputed in a univariate way (univariate), using selected covariates (selected), or all the covariates available in covariates (all). A string.
creatinine, to handle confounding by dilution. A named list with elements:
method, the method to be used. Currently, only covariate-adjusted standardization is implemented (cas). A string.
method_fit_args, options for fitting the models. Currently, only the family to be used within glm. A list.
creatinine_covariates_names, .
creatinine_name, .
path_save_res, path to directory where to save figures.
transform, to transform variables. A named list with elements:
transformation_fun, the transformation function (e.g., log).
standardization, to standardize variables. A named list with elements:
center_fun, the centering function (e.g., median).
scale_fun, the scaling function (e.g., IQR).
bound, to bound the outcome variable. A named list with elements:

id_var

The variable name to be used to identify subjects. A string.

by_var

The variable name to group by. A string.

Value

A pre-processed dataframe. A dataframe.