This function performs basic pre-processing of a given dataframe:

  • Data cleaning:

    • Handling values below the limit of detection (LOD) or quantification (LOQ).

    • Handling missing values.

  • Variable transformations:

    • Standardization (with custom center and scale functions).

    • Correcting for urine dilution with creatinine.

    • Outcome bounding (for e.g., TMLE).

preproc_data(
  dat,
  dat_desc,
  covariates,
  outcome,
  dat_llodq,
  dic_steps,
  id_var,
  by_var
)

Arguments

dat

A dataframe containing the variables of interest. A dataframe.

dat_desc

A optional dataframe containing the same variables but with information on the type of values (e.g., 2 for value <LOD). This is necessary to distinguish the type of missing values (e.g., missing because <LOD or because no sample was available). A dataframe.

covariates

A dataframe containing additional variables. A dataframe.

outcome

A string indicating the outcome variable. A string.

dat_llodq

A optional dataframe containing the LOD/LOQ values (val) for the variables (var) of interest. A dataframe.

dic_steps

A ordered, nested named list of steps to perform. A list. It can include the following elements:

  • llodq, to handle values <LOD/LOQ. A named list with elements:

  • id_val, which values in dat_desc should be considered. A vector of integers.

  • method, the method to be used. Currently, only a replacement approach is supported. A string.

  • divide_by, .

  • creatinine_threshold, subjects for which the creatinine levels are below this value will not be processed. Currently not used. A double.

  • threshold_within, the threshold within each group for the fraction of values corresponding to id_val. An integer.

  • threshold_overall, the overall threshold for the fraction of values corresponding to id_val. An integer.

  • tune_sigma, currently not supported. A double.

  • missings, to handle missing values. A named list with elements:

  • threshold_within, the missing value threshold within each group. An integer.

  • threshold_overall, the overall missing value threshold. An integer.

  • use_additional_covariates, .

  • selected_covariates, a vector of covariates' names. A vector.

  • method_imputation, method to be used to impute values. A string.

  • k, the number of nearest neighbors to use for kNN. An integer.

  • path_save_res, path to directory where to save figures. Currently, variables can be imputed in a univariate way (univariate), using selected covariates (selected), or all the covariates available in covariates (all). A string.

  • creatinine, to handle confounding by dilution. A named list with elements:

  • method, the method to be used. Currently, only covariate-adjusted standardization is implemented (cas). A string.

  • method_fit_args, options for fitting the models. Currently, only the family to be used within glm. A list.

  • creatinine_covariates_names, .

  • creatinine_name, .

  • path_save_res, path to directory where to save figures.

  • transform, to transform variables. A named list with elements:

  • transformation_fun, the transformation function (e.g., log).

  • standardization, to standardize variables. A named list with elements:

  • center_fun, the centering function (e.g., median).

  • scale_fun, the scaling function (e.g., IQR).

  • bound, to bound the outcome variable. A named list with elements:

id_var

The variable name to be used to identify subjects. A string.

by_var

The variable name to group by. A string.

Value

A pre-processed dataframe. A dataframe.