preproc_data.Rd
This function performs basic pre-processing of a given dataframe:
Data cleaning:
Handling values below the limit of detection (LOD) or quantification (LOQ).
Handling missing values.
Variable transformations:
Standardization (with custom center and scale functions).
Correcting for urine dilution with creatinine.
Outcome bounding (for e.g., TMLE).
preproc_data(
dat,
dat_desc,
covariates,
outcome,
dat_llodq,
dic_steps,
id_var,
by_var
)
A dataframe containing the variables of interest. A dataframe.
A optional dataframe containing the same variables but with information on the type of values (e.g., 2 for value <LOD). This is necessary to distinguish the type of missing values (e.g., missing because <LOD or because no sample was available). A dataframe.
A dataframe containing additional variables. A dataframe.
A string indicating the outcome variable. A string.
A optional dataframe containing the LOD/LOQ values (val
)
for the variables (var
) of interest. A dataframe.
A ordered, nested named list of steps to perform. A list. It can include the following elements:
llodq
, to handle values <LOD/LOQ. A named list with elements:
id_val
, which values in dat_desc
should be considered. A vector of integers.
method
, the method to be used. Currently, only a replacement
approach is supported. A string.
divide_by
, .
creatinine_threshold
, subjects for which the creatinine levels are
below this value will not be processed. Currently not used. A double.
threshold_within
, the threshold within each group for the
fraction of values corresponding to id_val
. An integer.
threshold_overall
, the overall threshold for the
fraction of values corresponding to id_val
. An integer.
tune_sigma
, currently not supported. A double.
missings
, to handle missing values. A named list with elements:
threshold_within
, the missing value threshold within each group. An integer.
threshold_overall
, the overall missing value threshold. An integer.
use_additional_covariates
, .
selected_covariates
, a vector of covariates' names. A vector.
method_imputation
, method to be used to impute values. A string.
k
, the number of nearest neighbors to use for kNN. An integer.
path_save_res
, path to directory where to save figures.
Currently, variables can be imputed in a univariate way (univariate
), using
selected covariates (selected
), or all the covariates
available in covariates
(all
). A string.
creatinine
, to handle confounding by dilution. A named list with elements:
method
, the method to be used. Currently, only covariate-adjusted
standardization is implemented (cas
). A string.
method_fit_args
, options for fitting the models.
Currently, only the family to be used within glm. A list.
creatinine_covariates_names
, .
creatinine_name
, .
path_save_res
, path to directory where to save figures.
transform
, to transform variables. A named list with elements:
transformation_fun
, the transformation function (e.g., log
).
standardization
, to standardize variables. A named list with elements:
center_fun
, the centering function (e.g., median
).
scale_fun
, the scaling function (e.g., IQR
).
bound
, to bound the outcome variable. A named list with elements:
The variable name to be used to identify subjects. A string.
The variable name to group by. A string.
A pre-processed dataframe. A dataframe.