Title: | Selection by Partitioning the Solution Paths |
---|---|
Description: | An implementation of the feature Selection procedure by Partitioning the entire Solution Paths (namely SPSP) to identify the relevant features rather than using a single tuning parameter. By utilizing the entire solution paths, this procedure can obtain better selection accuracy than the commonly used approach of selecting only one tuning parameter based on existing criteria, cross-validation (CV), generalized CV, AIC, BIC, and extended BIC (Liu, Y., & Wang, P. (2018) <doi:10.1214/18-EJS1434>). It is more stable and accurate (low false positive and false negative rates) than other variable selection approaches. In addition, it can be flexibly coupled with the solution paths of Lasso, adaptive Lasso, ridge regression, and other penalized estimators. |
Authors: | Xiaorui (Jeremy) Zhu [aut, cre], Yang Liu [aut], Peng Wang [aut] |
Maintainer: | Xiaorui (Jeremy) Zhu <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.0.9000 |
Built: | 2025-01-06 03:46:29 UTC |
Source: | https://github.com/xiaoruizhu/spsp |
An implementation of the feature Selection procedure by Partitioning the entire Solution Paths (namely SPSP) to identify the relevant features rather than using a single tuning parameter. By utilizing the entire solution paths, this procedure can obtain better selection accuracy than the commonly used approach of selecting only one tuning parameter based on existing criteria, cross-validation (CV), generalized CV, AIC, BIC, and EBIC (Liu, Y., & Wang, P. (2018)). It is more stable and accurate (low false positive and false negative rates) than other variable selection approaches. In addition, it can be flexibly coupled with the solution paths of Lasso, adaptive Lasso, ridge regression, and other penalized estimators.
This package includes two main functions and several functions (fitfun.SP) to obtains
the solution paths. The SPSP
function allows users to specify the penalized likelihood
approaches that will generate the solution paths for the SPSP procedure. Then this function
will automatically partitioning the entire solution paths. Its key idea is to classify variables
as relevant or irrelevant at each tuning parameter and then to select all of the variables
which have been classified as relevant at least once. The SPSP_step
purely apply the
partitioning step that needs the solution paths as the input. In addition, there are several
functions to obtain the solution paths. They can be used as an input of fitfun.SP
argument.
Xiaorui (Jeremy) Zhu, [email protected],
Yang Liu, [email protected],
Peng Wang, [email protected]
Liu, Y., & Wang, P. (2018). Selection by partitioning the solution paths. Electronic Journal of Statistics, 12(1), 1988-2017. <10.1214/18-EJS1434>
fitfun.SP
argument
to obtain the solution paths for the SPSP algorithm.The user can also customize a function to generate the solution paths, as long as
the customized function take the arguments x, y, family, standardize, intercept,
and return an object of class
glmnet
, ncvreg
, and lars
.
lasso.glmnet(x, y, family, standardize, intercept, ...) lassoCV.glmnet(x, y, family, standardize, intercept, ...) adalasso.glmnet(x, y, family, standardize, intercept, ...) adalassoCV.glmnet(x, y, family, standardize, intercept, ...) adalassoCVmin.glmnet(x, y, family, standardize, intercept, ...) ridge.glmnet(x, y, family, standardize, intercept, ...) lasso.lars(x, y, family, standardize, intercept, ...) SCAD.ncvreg(x, y, family, standardize, intercept, ...) MCP.ncvreg(x, y, family, standardize, intercept, ...)
lasso.glmnet(x, y, family, standardize, intercept, ...) lassoCV.glmnet(x, y, family, standardize, intercept, ...) adalasso.glmnet(x, y, family, standardize, intercept, ...) adalassoCV.glmnet(x, y, family, standardize, intercept, ...) adalassoCVmin.glmnet(x, y, family, standardize, intercept, ...) ridge.glmnet(x, y, family, standardize, intercept, ...) lasso.lars(x, y, family, standardize, intercept, ...) SCAD.ncvreg(x, y, family, standardize, intercept, ...) MCP.ncvreg(x, y, family, standardize, intercept, ...)
x |
a matrix of the independent variables. The dimensions are (nobs) and (nvars); each row is an observation vector. |
y |
Response variable. Quantitative for |
family |
Response type. Either a character string representing one of the built-in families, or else a glm() family object. |
standardize |
logical argument. Should conduct standardization before the estimation? Default is TRUE. |
intercept |
logical. If x is a data.frame, this argument determines if the resulting model matrix should contain a separate intercept or not. Default is TRUE. |
... |
Additional optional arguments. |
An object of class "glmnet"
is returned to provide solution paths for the SPSP algorithm.
An object of class "glmnet"
is returned to provide solution paths for the SPSP algorithm.
An object of class "glmnet"
is returned to provide solution paths for the SPSP algorithm.
An object of class "glmnet"
using lambda.1se
from the 10-fold
cross-validation to provide solution paths for the SPSP algorithm.
An object of class "glmnet"
using lambda.min
from the 10-fold
cross-validation to provide solution paths for the SPSP algorithm.
An object of class "glmnet"
using ridge regression to provide solution
paths for the SPSP algorithm.
An object of class "lars"
is returned to provide solution paths for the SPSP algorithm.
An object of class "ncvreg"
to provide SCAD penalty solution paths for the SPSP algorithm.
An object of class "ncvreg"
to provide MCP penalty solution paths for the SPSP algorithm.
A dataset with 200 observations and 500 dimensions is generated from the following process: linear regression model with only first three non-zero coefficients equal to 3, 2, and 1.5 respectively. The covariates are correlated with AR structure (rho=0.3). The error term is normally distributed with zero mean and sd equals to 0.5.
data(HighDim)
data(HighDim)
# HighDim dataset is generated from the following process: n <- 200; p <- 500; sigma <- 0.5 beta <- rep(0, p); nonzero <- c(1, 2, 3); zero <- setdiff(1:p, nonzero) beta[nonzero] <- c(3, 2, 1.5) Sigma <- 0.3^(abs(outer(1:p,1:p,"-"))) library(MASS) X <- mvrnorm(n, rep(0,p), Sigma) error <- rnorm(n, 0, sigma) X <- apply(X, 2, scale) * sqrt(n)/sqrt(n-1) error <- error - mean(error) Y <- X %*% beta + error HighDim <- data.frame(Y, X) head(HighDim)
# HighDim dataset is generated from the following process: n <- 200; p <- 500; sigma <- 0.5 beta <- rep(0, p); nonzero <- c(1, 2, 3); zero <- setdiff(1:p, nonzero) beta[nonzero] <- c(3, 2, 1.5) Sigma <- 0.3^(abs(outer(1:p,1:p,"-"))) library(MASS) X <- mvrnorm(n, rep(0,p), Sigma) error <- rnorm(n, 0, sigma) X <- apply(X, 2, scale) * sqrt(n)/sqrt(n-1) error <- error - mean(error) Y <- X %*% beta + error HighDim <- data.frame(Y, X) head(HighDim)
A user-friendly function to conduct the selection by Partitioning the Solution Paths (the SPSP algorithm). The
user only needs to specify the independent variables matrix, response, family, and fitfun.SP
.
SPSP( x, y, family = c("gaussian", "binomial"), fitfun.SP = adalasso.glmnet, args.fitfun.SP = list(), standardize = TRUE, intercept = TRUE, ... )
SPSP( x, y, family = c("gaussian", "binomial"), fitfun.SP = adalasso.glmnet, args.fitfun.SP = list(), standardize = TRUE, intercept = TRUE, ... )
x |
A matrix with all independent variables, of dimension n by p; each row is an observation vector with p variables. |
y |
Response variable. Quantitative for |
family |
Response type. Either a character string representing one of the built-in families, or else a glm() family object. |
fitfun.SP |
A function to obtain the solution paths for the SPSP algorithm. This function takes the arguments
x, y, family as above, and additionally the standardize and intercept and others in
|
args.fitfun.SP |
A named list containing additional arguments that are passed to the fitting function;
see also argument |
standardize |
logical argument. Should conduct standardization before the estimation? Default is TRUE. |
intercept |
logical. If x is a data.frame, this argument determines if the resulting model matrix should contain a separate intercept or not. Default is TRUE. |
... |
Additional optional arguments. |
An object of class "SPSP"
is a list containing at least the following components:
beta_SPSP |
the estimated coefficients of SPSP selected model; |
S0 |
the estimated relevant sets; |
nonzero |
the selected covariates; |
zero |
the covariates that are not selected; |
thres |
the boundaries for abs(beta); |
R |
the sorted adjacent distances; |
intercept |
the estimated intercept when |
This object has attribute contains:
mod.fit |
the fitted penalized regression within the input function |
family |
the family of fitted object; |
fitfun.SP |
the function to obtain the solution paths for the SPSP algorithm; |
args.fitfun.SP |
a named list containing additional arguments for the function |
data(HighDim) library(glmnet) # Use the high dimensional dataset (data(HighDim)) to test SPSP+Lasso and SPSP+AdaLasso: data(HighDim) x <- as.matrix(HighDim[,-1]) y <- HighDim[,1] spsp_lasso_1 <- SPSP::SPSP(x = x, y = y, family = "gaussian", fitfun.SP = lasso.glmnet, init = 1, standardize = FALSE, intercept = FALSE) head(spsp_lasso_1$nonzero) head(spsp_lasso_1$beta_SPSP) spsp_adalasso_5 <- SPSP::SPSP(x = x, y = y, family = "gaussian", fitfun.SP = adalasso.glmnet, init = 5, standardize = TRUE, intercept = FALSE) head(spsp_adalasso_5$nonzero) head(spsp_adalasso_5$beta_SPSP)
data(HighDim) library(glmnet) # Use the high dimensional dataset (data(HighDim)) to test SPSP+Lasso and SPSP+AdaLasso: data(HighDim) x <- as.matrix(HighDim[,-1]) y <- HighDim[,1] spsp_lasso_1 <- SPSP::SPSP(x = x, y = y, family = "gaussian", fitfun.SP = lasso.glmnet, init = 1, standardize = FALSE, intercept = FALSE) head(spsp_lasso_1$nonzero) head(spsp_lasso_1$beta_SPSP) spsp_adalasso_5 <- SPSP::SPSP(x = x, y = y, family = "gaussian", fitfun.SP = adalasso.glmnet, init = 5, standardize = TRUE, intercept = FALSE) head(spsp_adalasso_5$nonzero) head(spsp_adalasso_5$beta_SPSP)
A function to select the relevant predictors by partitioning the solution paths (the SPSP algorithm)
based on the user provided solution paths BETA
.
SPSP_step( x, y, family = c("gaussian", "binomial"), BETA, standardize = TRUE, intercept = TRUE, init = 1, R = NULL, ... )
SPSP_step( x, y, family = c("gaussian", "binomial"), BETA, standardize = TRUE, intercept = TRUE, init = 1, R = NULL, ... )
x |
independent variables as a matrix, of dimension nobs x nvars; each row is an observation vector. |
y |
response variable. Quantitative for |
family |
either a character string representing one of the built-in families, or else a glm() family object. |
BETA |
the solution paths obtained from a prespecified fitting step |
standardize |
whether need standardization. |
intercept |
logical. If x is a data.frame, this argument determines if the resulting model matrix should contain a separate intercept or not. |
init |
initial coefficients, starting from init-th estimator of the solution paths. The default is 1. |
R |
sorted adjacent distances, default is NULL. Will be calculated inside. |
... |
Additional optional arguments. |
A list containing at least the following components:
beta_SPSP |
the estimated coefficients of SPSP selected model; |
S0 |
the estimated relevant sets; |
nonzero |
the selected covariates; |
zero |
the covariates that are not selected; |
thres |
the boundaries for abs(beta); |
R |
the sorted adjacent distances; |
intercept |
the estimated intercept when |
This object has attribute contains:
mod.fit |
the fitted penalized regression within the input function |
family |
the family of fitted object; |
fitfun.SP |
the function to obtain the solution paths for the SPSP algorithm; |
args.fitfun.SP |
a named list containing additional arguments for the function |
data(HighDim) library(glmnet) x <- as.matrix(HighDim[,-1]) y <- HighDim[,1] lasso_fit <- glmnet(x = x, y = y, alpha = 1, intercept = FALSE) # SPSP+Lasso method K <- dim(lasso_fit$beta)[2] LBETA <- as.matrix(lasso_fit$beta) spsp_lasso_1 <- SPSP_step(x = x, y = y, BETA = LBETA, init = 1, standardize = FALSE, intercept = FALSE) head(spsp_lasso_1$nonzero) head(spsp_lasso_1$beta_SPSP)
data(HighDim) library(glmnet) x <- as.matrix(HighDim[,-1]) y <- HighDim[,1] lasso_fit <- glmnet(x = x, y = y, alpha = 1, intercept = FALSE) # SPSP+Lasso method K <- dim(lasso_fit$beta)[2] LBETA <- as.matrix(lasso_fit$beta) spsp_lasso_1 <- SPSP_step(x = x, y = y, BETA = LBETA, init = 1, standardize = FALSE, intercept = FALSE) head(spsp_lasso_1$nonzero) head(spsp_lasso_1$beta_SPSP)