BDcocolasso

R software package to implement high-dimensional error-in-variables regression. This package implements CoCoLasso algorithm in settings with additive error or missing data in the covariates. This package also implements a variation of the CoCoLasso algorithm called Block-Descent CoCoLasso (or BD-CoCoLasso), which focuses on a setting where only a small percentage of the features are corrupted (with additive error or missing data)

This package is based on the CoCoLasso algorithm. CoCoLASSO requires a computationally demanding positive semi-definite projection of the covariance matrix for a high dimensional feature set. In a very high-dimensional context where there are both corrupted and uncorrupted covariates and where the portion of corrupted features is relatively small, we take advantage of the block descent minimization to develop a more efficient algorithm called BDCoCoLasso. In an alternating block minimization algorithm, the CoCoLasso corrections are used when updating corrupted coefficient vectors, and a simple LASSO is used for the uncorrupted coefficient vectors. Both sub-problems are convex and hence a global solution can be obtained, even though adaption of the cross-validation step requires care in this setting where there are products of corrupted and uncorrupted matrices.

Installation

install.packages("BDcocolasso")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("celiaescribe/BDcocolasso")

Vignette

See the online vignette for details about the BDcoco model and example usage of the functions.

Model input

There exist two settings in which the BD-CoCoLasso can be used : in the simple CoCoLasso version, and in the Block-Descent-CoCoLasso version. The inputs vary according to the chosen algorithm setting, and according to the chosen noise setting.

CoCoLasso setting: This method requires seven inputs (let n be the number of observations and p the number of X variables):

X: n x p matrix of covariates. Can be high-dimensional, i.e., p >> n. Must be continuous or with binary categorical features. Can contain missing values in NA format in the missing data setting.
y: a continuous response of length n.
n: Number of samples
p: Number of covariates
noise: Type of noise setting. There are two possibilities : additive or missing. In the additive setting it is necessary to specify the tau parameter, corresponding to the standard deviation of the additive error matrix. In the missing setting, nothing has to be specified.
block: Chosen setting. Here, block should be equal to FALSE.
penalty: Type of penalty chosen. It can be equal to lasso or SCAD according to the chosen penalty setting.

BD-CoCoLasso setting: This method requires nine inputs (let n be the number of observations, p the number of X variables, p1 the number of uncorrupted variables and p2 the number of corrupted variables, with p1 + p2 = p):

X: n x p matrix of covariates. Can be high-dimensional, i.e., p >> n. Must be continuous or with binary categorical features. Can contain missing values in NA format in the missing data setting. The first p1 columns must correspond to the uncorrupted covariates, and the last p2 columns must correspond to the corrupted covariates.
y: a continuous response of length n.
n: Number of samples
p: Number of covariates
p1: Number of uncorrupted covariates
p2: Number of corrupted covariates
noise: Type of noise setting. There are two possibilities : additive or missing. In the additive setting it is necessary to specify the tau parameter, corresponding to the standard deviation of the additive error matrix. In the missing setting, nothing has to be specified.
block: Chosen setting. Here, block should be equal to TRUE.
penalty: Type of penalty chosen. It can be equal to lasso or SCAD according to the chosen penalty setting.

Three-block BD-CoCoLasso setting: This method handles a mixed error setting where both additive error and missing data occur. This requires excecuting the function generalcoco. The required inputs are the same as the BD-CoCoLasso setting except that p2 stands for the number of corrupted covariates measured with additive error and an additional parameter p3 stands for the number of corrupted covariates measured with missing data. It is essential in both settings that the covariates are sorted with the uncorrupted covariates are in the first columns. In the Three-block setting, the additive-error-containing covariates should precede the missing-data-containing covariates as well.

Contact

email : celia.escribe@polytechnique.edu; tianyuan.lu@mail.mcgill.ca; karim.oualkacha@uqam.ca; celia.greenwood@mcgill.ca

Credit

We were inspired by the following studies :

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
R		R
data-raw		data-raw
data		data
man		man
simulation		simulation
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
BDcocolasso.Rproj		BDcocolasso.Rproj
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

BDcocolasso

Installation

Vignette

Model input

Contact

Credit

About

Licenses found

Releases

Packages

Languages

License

Licenses found

celiaescribe/BDcocolasso

Folders and files

Latest commit

History

Repository files navigation

BDcocolasso

Installation

Vignette

Model input

Contact

Credit

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages