Skip to content
Publicly Available Published by De Gruyter September 20, 2019

Exploration of Heterogeneous Treatment Effects via Concave Fusion

  • Shujie Ma , Jian Huang , Zhiwei Zhang EMAIL logo and Mingming Liu

Abstract

Understanding treatment heterogeneity is essential to the development of precision medicine, which seeks to tailor medical treatments to subgroups of patients with similar characteristics. One of the challenges of achieving this goal is that we usually do not have a priori knowledge of the grouping information of patients with respect to treatment effect. To address this problem, we consider a heterogeneous regression model which allows the coefficients for treatment variables to be subject-dependent with unknown grouping information. We develop a concave fusion penalized method for estimating the grouping structure and the subgroup-specific treatment effects, and derive an alternating direction method of multipliers algorithm for its implementation. We also study the theoretical properties of the proposed method and show that under suitable conditions there exists a local minimizer that equals the oracle least squares estimator based on a priori knowledge of the true grouping information with high probability. This provides theoretical support for making statistical inference about the subgroup-specific treatment effects using the proposed method. The proposed method is illustrated in simulation studies and illustrated with real data from an AIDS Clinical Trials Group Study.

1 Introduction

Treatment effects are often heterogeneous, that is, the same treatment can have different effects on different patients [1, 2]. For instance, molecularly targeted cancer drugs are only effective for patients with tumors expressing targets [3], and the relative efficacy of antiretroviral drugs for treating human immunodeficiency virus infection sometimes depends on baseline viral load and CD4 count [4]. Understanding the heterogeneity of treatment effects (HTE) is critical to the eventual success of precision medicine, which seeks to tailor medical treatments to individual patients.

Possible HTE is usually assessed in a subgroup analysis [1, 5, 6, 7, 8] or, more generally, a regression analysis relating the outcome of interest to treatment and a collection of baseline covariates [9]. Such a regression model can incorporate HTE as interactions between treatment and baseline covariates, and can be used to estimate covariate-specific treatment effects indirectly. Alternatively, a covariate-specific treatment effect model can be specified and estimated directly without relying on a regression model for the outcome [10, 11]. There is also a growing literature on HTE estimation using machine learning methods (e. g. [12, 13, 14]). All of these methods are based on observed baseline covariates; they do not address possible HTE due to unmeasured covariates.

The collection of observed baseline covariates is often limited, and thus may be insufficient for characterizing the true HTE across individual patients. The true HTE structure is not empirically identifiable unless all effect modifiers are measured, but can be explored under appropriate assumptions. Zhang et al. [4, 15] use random effect models to conduct sensitivity analyses concerning the joint distribution of two potential outcomes (for two different treatments). Shen and He [16] propose a two-group logistic-normal mixture model for the true HTE and develop a model-based procedure for testing the equivalence of the two groups. The model of Shen and He [16] includes a normality assumption for the outcome and a logistic regression model relating the latent group membership to a collection of observed covariates.

In this article, we propose a general latent class model for exploring the true HTE. Different from the aforementioned existing models, our model deals with an unspecified number of latent groups, without assuming normality for the outcome or a particular relationship between the latent group membership and the observed covariates. In our model, we assume that the treatment coefficients are subject-specific and belong to different groups with unknown grouping information. The subgroups, therefore, can be driven by observed covariates, unobserved covariates or an arbitrary combination of both types of covariates. We recover the latent subgroups and estimate the model using concave fusion penalization, an unsupervised machine learning method, that applies a concave penalty, such as the smoothly clipped absolute deviations penalty [SCAD, [17]] or the minimax concave penalty [MCP, [18]], to pairwise differences of coefficients for treatment effects. The fusion penalized approach has been proposed for clustering analysis of grouping data objects [19] and grouping means of clinical outcomes with random errors [20], but not for exploring HTE. Ma and Huang [20] briefly mentioned potential applications of the fusion penalized approach to the estimation of subject-specific coefficient models without providing theoretical justifications and numerical studies.

We consider the present paper as the first work which applies the fusion learning approach to the investigation of HTE through subject-specific treatment coefficients and provides a theoretical analysis of the resulting estimator for the proposed general latent class model with subject-specific treatment effects. The consistency and asymptotic distributional properties of the estimators follow from the fact that, under appropriate conditions, the oracle least squares estimator based on a priori knowledge of the true group membership is a local minimizer of the objective function with high probability. Moreover, we derive the conditions on the number of subgroups and the number of covariates compared with the sample size as well as the lower bound of the minimum difference of coefficient values between the subgroups in order to identify the true subgroups of treatment effects. Computationally, we apply an alternating direction method of multipliers (ADMM) algorithm [21] for implementing the proposed approach. This algorithm has been used widely in convex optimization problems. We derive the convergence properties of the ADMM in the present setting. Another contribution is that using the proposed fusion penalized estimator, we further propose a bootstrapping procedure to test homogeneity of the treatment effects for confirmatory analysis.

The rest of this article is organized as follows. Section 2 introduces the proposed latent class model. Section 3 describes the concave fusion penalization method for model estimation. In Section 4 we establish the theoretical properties of the proposed estimator. Section 5 introduces a bootstrapping procedure for testing homogeneity. In Section 6 we evaluate the finite-sample properties of the proposed method via simulation studies, and apply the method to AIDS Clinical Trials Group Study 175. Concluding remarks are given in Section 7. The computational algorithm and the technical proofs are provided in the Appendix.

2 Model

Suppose the data consist of (yi,zi,xi),i=1,,n, where yi is an outcome variable, zi a q-vector of patient characteristics related to the outcome, and xi a p-vector of treatment or exposure characteristics. The vector xi may include discrete components (e. g. indicators of treatment groups or exposure status) and/or continuous components (e. g. dose of drug or radiation). Our goal is to understand how the causal effect of xi may vary across individual subjects. Let yi(x) denote the potential outcome for the ith subject under treatment x [22]. We assume that treatment assignment is strongly ignorable [23] in the sense that

(1)xiyi(x)|zi,

where denotes independence and x may be any treatment that could conceivably be given to the ith subject. From this assumption it follows that

E{yi(x)|zi}=E{yi(x)|zi,xi=x}=E{yi|zi,xi=x},

which is empirically identified for any possible treatment x for the ith subject. Assumption (1) is trivially true in a randomized clinical trial where xi is independent of all baseline variables; in this case, adjustment for zi is not strictly necessary but may improve efficiency.

Suppose the data generation mechanism can be described by the following linear regression model:

(2)yi=ziTη+xiTβi+εi,i=1,,n,

where η and βi are unknown regression coefficients and the ɛi’s are i.i.d. random errors with E(ɛi) = 0 and Var(εi)=σ2. We assume that the first entry in each xi is 1 so the intercept is included in βi. Under this model and the strongly ignorable treatment assignment assumption, the subject-specific regression coefficient βi represents the causal effect of xi on the ith subject, and the distribution of βi across i represents the true HTE. With more unknown parameters than observations, model (2) is not estimable without additional assumptions.

A simple way to constrain model (2) and achieve identification is to assume that the βi are all equal. This leads to the following model:

(3)yi=ziTη+xiTβ+εi,i=1,,n,

which is commonly used in practice. Model (3) is quite restrictive as it rules out any HTE with respect to observed and unobserved covariates. The model can be expanded by including interaction terms between some or all components of zi and xi, thereby allowing βi to be a specified function of zi. The expanded model can accommodate a particular form of observed HTE, but it does not account for any latent HTE. Moreover, even if the HTE is only driven by the observed covariates zi, it is unclear in what specific form zi causes the HTE. For instance, the common assumption that βi=β+Γti, where ti is a subset of zi, and β and Γ are coefficients implies a linear interaction structure which may be too stringent in practice. When many baseline variables are present, there can be many interaction terms. This increases the chance of false positive results in interaction tests and may lead to overstated and misleading conclusions [24].

Another approach is to treat βi as a subject-specific random vector following a specified conditional distribution given zi. A prominent example in this category is the common linear mixed model where βi is assumed to follow a normal distribution, independently of zi. Such a model may have identifiability issues in the present setting, where each subject contributes only one observation. Another example in this category, which achieves identifiability, is the two-group logistic-normal mixture model of Shen and He [16]. In the present notation, their model can be expressed as

(4)yi=xiTβi+εi, where βi=α1+α2wiand wi=I(ziTγ+ϵ>0),

where ε is an error distributed by the standard logistic distribution, so that P(wi=1|zi)=exp(ziTγ)/(1+exp(ziTγ)). This model assumes that the βi in (2) arise from a finite mixture model with two possible values (α1 and α1+α2), that ɛi is normally distributed, and that the latent group membership is related to observed covariates through a logistic regression model. Under these assumptions, model (4) can be estimated using a standard EM algorithm, and Shen and He [16] further develop an EM test for the null hypothesis α2=0, which indicates the absence of (observed or latent) HTE. This approach, as a model-based way to estimate the true HTE, is limited by the two-group assumption and the modeling assumptions.

Here we consider a more flexible latent class model and propose a machine learning approach to recovering the subgroups. We assume that the βi in (2) arise from a mixture model with an unspecified number of groups. Specifically, let G=(G1,,GK) be a mutually exclusive partition of {1, …, n}. Suppose βi=αk for all iGk, where αk is the common value for the βi’s in group Gk. In other words, we assume that

βi=α1wi1+α2wi2++αKwiK,

where αk=(αk1,,αkp)T, and wik ∈ (0, 1) is the (latent) indicator for the kth group Gk, i. e. wik = 1 for iGk and wik = 0 otherwise. Substituting this into model (2) yields the latent class model:

(5)yi=ziTη+xiT(α1wi1+α2wi2++αKwiK)+εi,

where the number of subgroups K is unknown and the group indicators wik are unobservable. Model (5) includes the case of no HTE as a special case (i. e. K = 1 and hence wi11). Of note, we do not parameterize the distribution of ɛi, nor do we specify how the wik’s may be related to zi. Thus, our model is considerably more flexible than the model of Shen and He [16].

Although the subgroups in model (5) are not (fully) ascertainable using observed covariates, an exploratory analysis based on model (5) can provide unique insights into the true HTE and helpful guidance for future research. For example, if the results indicate that a new treatment only benefits a small (and unidentified) proportion of the patient population, that finding might motivate scientists to collect more covariate data (e. g. biomarkers) in future studies and search for predictors of treatment benefit. Conversely, if there is no indication of clinically important HTE, that information would support a decision to (re)direct limited resources toward other, more promising areas of research.

3 Estimation

To identify the subgroups of the heterogeneous treatment effects, we first need to estimate model (5), i. e. we need to estimate the number of subgroups K, the coefficients η and α, and the group membership wik for each observation. Estimating model (5) is the same as estimation of model (2), since they are the same models with different notations. We propose a concave fusion method for model estimation described below.

For any vector a, denote its L2 norm by a=(|ai|2)1/2. Consider the criterion

(6)Qn(η,β)=12i=1n(yiziTηxiTβi)2+1i<jnp(βiβj,λ),

where β=(β1T,,βnT)T, and p(·, λ) is a penalty function with a tuning parameter λ ≥ 0. For a given λ > 0, let

(7)(ηˆ(λ),βˆ(λ))=argminηIRq,βIRnpQn(η,β;λ).

We use sparsity-inducing penalties (to be discussed later) in (6). For a sufficiently large λ, the penalty shrinks some of βiβj to zero. We partition the treatment effects into subgroups according to the unique values of βˆ Specifically, let λˆbe the value of the tuning parameter on the path selected based on a data-driven procedure such as the BIC. For simplicity, write (ηˆ,βˆ)(ηˆ(λˆ),βˆ(λˆ)). Let {αˆ1,,αˆKˆ} be the distinct values of βˆ, where Kˆ is the number of the distinct values. These are the estimates of subgroup-specific treatment effects. The samples can then be divided into subgroups accordingly. Denote the set of indices of the kth subgroup by Gˆk={i:βˆi=αˆk,1in},1kKˆ. Accordingly, we have wˆik=1 , if iGˆk and wˆik=0, otherwise.

A popular sparsity-inducing penalty is the L1 or lasso penalty with pγ(t,λ)=λ|t| [25], but this penalty tends to produce too many subgroups [20]. Thus, we focus on two concave penalty functions: the smoothly clipped absolute deviation penalty [SCAD, [17]] and the minimax concave penalty [MCP, [18]]. The SCAD penalty is

pγ(t,λ)=λ0|t|min{1,(γx/λ)+/(γ1)}dx.

The MCP has the form

pγ(t,λ)=λ0|t|(1x/(γλ))+dx.

These penalties lead to nearly unbiased estimators of the parameters due to the fact that their derivatives equal zero at large (in magnitude) values of the parameter estimates. Moreover, they are more aggressive in enforcing a sparser solution. Thus, they are better suited for the current problem, where the number of subgroups may be expected to be much smaller than the sample size.

We compute (ηˆ(λ),βˆ(λ)) given in (7) for λ in a given interval [λmin,λmax], where λmax is the value that forces a constant βˆ solution, i. e. βˆj(λmax)=βˆk(λmax),1j<kn ; λmin is a small positive number. We are particularly interested in the path {βˆ(λ):λ[λmin,λmax]}. The ADMM algorithm for computing the solution path on a grid of λ values is described in detail in Section A.1 of the Appendix. We also derive the convergence property of the ADMM algorithm with concave penalties. The result is given in Proposition A.1 of the Appendix.

Figure 1 illustrates the solution path for the estimates of the treatment coefficients (βˆ21(λ),,βˆ2n(λ)) against λ using MCP, SCAD and lasso penalties for simulated data in Example 1 of Section 6, which has two subgroups with ‘treatment effects’ 0 and 2, respectively. The path is calculated using a “bottom up” approach starting from λmin. It looks similar to the dendrogram for agglomerative hierarchical clustering. However, unlike the clustering algorithms which form the clusters based on a direct measure of dissimilarity, the fusion of the coefficients is based on solving the optimization problems along the solution path. We shall refer to the solution path {βˆ(λ),λ[λmin,λmax]} as a fusiongram.

In Figure 1, the fusiongrams for SCAD and MCP look similar. They both include a segment containing nearly unbiased estimates of the treatment effects. When the λ value reaches around 0.6, the estimates of (β21,,β2n) merge into two values that are close to the true values 0 and 2. When the λ value exceeds 1.0, the estimates shrink to one value. For the lasso, the estimates of (β21,,β2n) merge to one value quickly at λ = 0.05 due to the overshrinkage of the L1 penalty.

Figure 1: Solution paths for (βˆ21(λ),…,βˆ2n(λ))(\widehat{\beta }_{21}(\lambda ),\ldots ,\widehat{\beta }_{2n}(\lambda )) against λ\lambda with n = 200 for data from Example 1 in Section 6.
Figure 1:

Solution paths for (βˆ21(λ),,βˆ2n(λ)) against λ with n = 200 for data from Example 1 in Section 6.

4 Theoretical properties

In this section, we study the theoretical properties of the proposed estimator. Specifically, we provide sufficient conditions under which there exists a local minimizer of the objective function equal to the oracle least squares estimator with a priori knowledge of the true groups with high probability. We also derive the lower bound of the minimum difference of coefficients between subgroups in order to be able to estimate the subgroup-specific treatment effects.

4.1 Notation and conditions

Let W˜={wik} be an n × K matrix with wik = 1 for iGk and wik = 0 otherwise. Let W=W˜Ip. Let MG={βIRnp\} :βi=βj,fori,jGk,1kK. For each βMG, it can be written as β=Wα, where α=(α1T,,αKT)T and αk is a p × 1 vector of the kth subgroup-specific parameter for k = 1, …, K. Simple calculation shows

WTW=diag(G1,,GK)Ip,

where Gk denotes the number of elements in Gk. Denote the minimum and maximum group sizes by Gmin=min1kKGk and Gmax=max1kKGk, respectively. For any positive numbers an and bn , let anbn denote an1bn=o(1). For any vector ζ=ζ1,,ζsTIRs, let ζ=max1lsζl. For any symmetric matrix As×s, denote its L2 norm by A=maxζRs,ζ=1Aζ, and let λmin(A) and λmax(A) be the smallest and largest eigenvalues of A, respectively. For any matrix A=Aiji=1,j=1s,t, denote A=max1isj=1tAij. Let y=(y1,,yn)T, Z=(z1,,zn)T, and X=diag(x1T,,xnT). Denote X˜=XW and U=(Z,XW). Finally, denote the scaled penalty function by

ρ(t)=λ1pγ(t,λ).

We make the following basic assumptions.

  1. The function ρ(t) is symmetric, non-decreasing and concave on [0,). It is constant for t ≥ aλ for some constant a > 0, and ρ(0) = 0. In addition, ρ(t) exists and is continuous except for a finite number values of t and ρ(0 + ) = 1.

  2. The noise vector ε=(ε1,,εn)T has sub-Gaussian tails such that P(|aTε|>ax)2exp(c1x2) for any vector aIRn and x > 0, where 0<c1<.

  3. Assume i=1nzil2=n for 1 ≤ l ≤ q, and i=1nxij21{iGk}=Gk for 1 ≤ j ≤ p, λmin(UTU)C1Gmin, supixiC2p and supiziC3q for some constants 0<C1<, 0<C2< and 0<C3<.

Conditions (C1) and (C2) are common assumptions in penalized regression in high-dimensional settings. The concave penalties such as MCP and SCAD satisfy (C1). In the literature, it is commonly assumed that the smallest eigenvalue of the transpose of the design matrix multiplied by the design matrix is bounded by C1n, which may not hold for UTU. By some calculation and X˜=XW, we have

X˜TX˜=diag iGkxixiT,k=1,,K.

By assuming that λmin(iGkxixiT)cGk for some constant 0<c<, we have λmin(X˜TX˜)cGmin. If ZTX˜=0 and λmin(ZTZ)Cn, we have

λmin(UTU)=min{λmin(ZTZ),λmin(X˜TX˜)}min(cGmin,Cn),

and Gminn/K. Therefore, we let the smallest eigenvalue in Condition (C3) be bounded below by C1Gmin.

4.2 Heterogeneous model

In this section, we study the theoretical properties of the proposed estimator under the heterogeneous model in which there are at least two subgroups, that is, K ≥ 2. If the underlying groups G1,,GK were known, the oracle estimator of (η,β) would be

(8)(ηˆor,βˆor)=argminηIRq,βMG12yZηXβ2.

Since β=W˜α, the oracle estimators for the common coefficient α and the coefficients η are

(ηˆor,αˆor)=argminηIR,αIRKp12||yZηXα||2=(UTU)1UTy.

Let αk0 be the true common coefficient vector for group Gk, k = 1,…,K and α0=((αk0)T,k=1,,K)T. Of course, oracle estimators are not real estimators; they are theoretical constructions useful for stating the properties of the proposed estimators.

Theorem 1

SupposeGmin(q+Kp)nlogn. Then under Conditions (C1)-(C3), we have with probability at least12(Kp+q)n1,

(9)((ηˆorη0)T,(αˆorα0)T)Tϕn,

and

βˆorβ0Gmaxϕn,supiβˆiorβi0ϕn,

where

(10)ϕn=c11/2C11q+KpGmin1nlogn.

Moreover, for any vectoranIRq+Kpwith||an||=1, we have asn,

(11)σn(an)1anT((ηˆorη0)T,(αˆorα0)T)TDN(0,1),

where

(12)σn(an)=σanT(UTU)1an1/2.

Remark 1

SinceGminn/K, by the conditionGmin(q+Kp)nlogn, thenq, Kandpmust satisfyKq+Kp=o{n(logn)1}.

Remark 2

By lettingGmin=δn/Kfor some constant 0 <δ ≤ 1, the bound(9)isϕn=c11/2C11δ1Kq+Kplogn/n. Moreover, ifq, Kandpare fixed quantities, thenϕn=Clogn/nfor some constant0<C<.

Let

bn=miniGk,jGk,kkβi0βj0=minkkαk0αk0

be the minimal difference of the common values between two groups.

Theorem 2

Suppose the conditions inTheorem 1hold. Ifbn > aλandλϕn, for some constanta > 0, whereϕnis given in(10), then there exists a local minimizer(ηˆ(λ),βˆ(λ))of the objective functionQn(η,β;λ)given in(6)satisfying

P(ηˆ(λ),βˆ(λ))=(ηˆor,βˆor)1.

Remark 3

Remark 1shows that the oracle estimator(ηˆor,βˆor)is a local minimizer of the objective function with a high probability, and thus the true groups can be recovered with the estimated common value for groupkgiven asαˆk(λ)=βˆiorforiGk. This result holds given thatbnϕn. As discussed in RemarkRemark 2, whenK, pandqare finite and fixed numbers andGmin=δn/Kfor some constant 0 <δ ≤ 1, bnClogn/nfor some constant0<C<.

Let αˆ(λ)=(αˆ1(λ)T,,αˆK(λ)T)T be the estimated treatment effects such that αˆk(λ)=βˆi(λ) for iGk, where k = 1,…,K, and βˆ(λ)={βˆi(λ)T,1in}T is the local minimizer given in Theorem 2. Based on the results in Theorem 1 and 2, we obtain the asymptotic distribution of (ηˆ(λ)T,αˆ(λ)T)T given in the following corollary.

Corollary 1

Under the conditions inTheorem 2, we have for anyanIRq+Kpwith||an||=1, asn,

σn(an)1anT((ηˆ(λ)η0)T,(αˆ(λ)α0)T)TDN(0,1),

withσn(an)given in(12). As a result, we have for any vectorsan1IRqwith||an1||=1andan2IRKp||an2||=1, asn,

σn11(an1)an1T(ηˆ(λ)η0)DN(0,1)andσn21(an2)an2T(αˆ(λ)α0)DN(0,1),

where

σn1(an1)=σan1T[ZTZZTX˜(X˜TX˜)1X˜TZ]1an11/2,σn2(an2)=σan2T[X˜TX˜X˜TZ(ZTZ)1ZTX˜]1an21/2.

Remark 4

From the oracle property inTheorem 2, we have thatP(Kˆ=K)1, Kˆis the estimated number of subgroups. Moreover, sinceβˆor=W˜αorandβˆ(λ)=W˜ˆαˆ(λ), thenP(W˜ˆ=W˜)1, whereW˜ˆ={wˆik}withwˆik=1foriGˆkandwik = 0 otherwise. Hence, the subgroup memberships can be recovered with a high probability.

Remark 5

The asymptotic distribution of the penalized estimators provides a theoretical justification for further conducting statistical inference about heterogeneity. By the results in CorollaryTheorem 1, for givenan1IRqandan2IRKp, 100(1 – α)% confidence intervals foran1Tη0andan2Tα0are given by

an1Tηˆ(λ)±zα/2σˆn1(an1)andan2Tαˆ(λ)±zα/2σˆn2(an2),

respectively, wherezα/2is the (1 – α/2)100 percentile of the standard normal, andσˆn1(an1)andσˆn2(an2)are estimates ofσn1(an1)andσn2(an2)withσ2estimated by

σˆ2=(nqKˆp)1i=1n(yiziTηˆxiTβˆi)2.

4.3 Homogeneous model

When the true model is the homogeneous model given as yi=ziTη+xiTα+εi,i=1,,n, we have β1==βn=α and K = 1. The penalized estimator (ηˆ(λ),βˆ(λ)) of (η,β), where β=(β1T,,βnT)T, also has the oracle property given as follows. We define the oracle estimator for (η,α) as

(ηˆor,αˆor)=argminηIRq,αIRp12||yZηxα||2=(UTU)1UTy,

where x=(x1,,xn) T and U=(Z,x). Let βˆor=(βˆ1orT,,βˆnorT)T, where βˆior=αˆor for all i. Let η0 and α0 be the true coefficient vectors. We introduce the following condition.

  1. Assume i=1nzil2=n for 1 ≤ l ≤ q, and i=1nxij2=n for 1 ≤ j ≤ p, λmin(UTU)C1n, supixiC2p and supiziC3q for some constants 0<C1<, 0<C2< and 0<C3<.

Theorem 3

Suppose Conditions (C1), (C2) and(C3*) hold. Ifp = o(n(logn) – 1) and q = o(n(logn) – 1), the oracle estimator has the property that with probability at least12(p+q)n1,

(13)||((ηˆorη0)T,(αˆorα0)T)T||ϕn,supi||βˆiorβi0||ϕn, 

where

ϕn=c11/2C11q+pn1logn,

in whichc1andC1are given in Conditions (C2) and(C3* ), respectively, and for any vectoranIRq+pwith||an||=1, asn,

(14)σn(an)1anT((ηˆorη0)T,(αˆorα0)T)TN(0,1),

where

σn(an)=σanT(UTU)1an1/2.

Moreover, ifλϕn, then there exists a local minimizer(ηˆ(λ),βˆ(λ))of the objective functionQn(η,β;λ)given in(6)satisfying

(15)P(ηˆ(λ),βˆ(λ))=(ηˆor,βˆor)1.

Remark 6

ByTheorem 3, the local minimizerβˆi(λ)=αˆ(λ)=αˆorfor alliwith probability approaching 1. Then, we have for any vectorsan1IRqwith||an1||=1andan2IRpwith||an2||=1, asn,

σn11(an1)an1T(ηˆ(λ)η0)DN(0,1)andσn21(an2)an2T(αˆ(λ)α0)DN(0,1),

where

σn1(an1)=σan1T[ZTZZTx(xTx)1xTZ]1an11/2,σn2(an2)=σan2T[xTxxTZ(ZTZ)1ZTx]1an21/2.

5 Testing of a homogeneous model

Next, we propose a residual bootstrapping procedure to test homogeneity of the treatment effects, i. e. to test whether the model is the homogeneous model, given as yi=ziTη+xiTα+εi,i=1,,n. We consider this model as the reduced model, and the full model is given in (2). We estimate the reduced model by OLS and obtain the resulting estimators as ηˆ and αˆR. We then estimate the full model by our proposed method and denote the resulting estimators as ηˆF and βˆiF. Let the fitted values be μˆiR=ziTηˆR+xiTαˆR and μˆiF=ziTηˆF+xiTβˆiF for the reduced and full models, respectively. Borrowing the idea from [26], we use the integrated squared deviation between μˆiR and μˆiF as the test statistic, which would be Tn=i=1n(μˆiFμˆiR)2/n. Let the residuals be ϵˆi=yiμˆiF for i = 1,…,n. We obtain a randomly resampled residual ϵˆi and then create synthetic response variables yi=μˆiR+ϵˆi. Using zi,xi,yi as bootstrap observations, we obtain the fitted values μˆiR and μˆiF, respectively, by refitting the reduced and full models, and then creat the bootstrapped version of the test statistic, denoted as Tn=i=1n(μˆiFμˆiR)2/n . Using the Monte Carlo simulations to approximate the conditional distribution L(Tn)=L(Tn|(zi,xi)i=1n), we obtain the (1 – α)th quantile tˆα and reject the hypothesis of homogeneity if Tn>tˆα at the significance level α. Alternatively, we can obtain the P -value pv by finding the (1pv)th quantile tˆpv which satisfies tˆpv=Tn.

6 Numerical studies

6.1 Simulation studies

We use the modified Bayes Information Criterion (BIC) [27] for high-dimensional data settings to select the tuning parameter by minimizing

(16)BIC(λ)=logi=1n(yiziTηˆ(λ)xiTβˆi(λ))2/n+Cnlognn(Kˆ(λ)p+q),

where Cn is a positive number which can depend on n. When Cn = 1, the modified BIC reduces to the traditional BIC [28]. Following [29], we use Cn = log(np + q). We select λ by minimizing the modified BIC.

One important evaluation criterion for clustering methods is their ability to reconstruct the true underlying cluster structure. We, therefore, use the Rand Index measure [30] to evaluate the accuracy of the clustering results. The Rand Index is viewed as a measure of the percentage of correct decisions made by an algorithm. It is computed by using the formula:

(17)RI=TP+TNTP+FP+FN+TN,

where a true positive (TP) decision assigns two observations from the same ground truth group to the same cluster, a true negative (TN) decision assigns two observations from different groups to different clusters, a false positive (FP) decision assigns two observations from different groups to the same cluster, and a false negative (FN) decision assigns two observations from the same group to different clusters. The Rand Index lies between 0 and 1. Higher values of the Rand Index indicate better performance of the algorithm.

Example 1

(Two subgroups). We simulate data from the heterogeneous model with two groups:

(18)yi=ziTη+xiTβi+εi,i=1,,n, with βi=α1wi1+α2wi2,

where zi=(zi1,zi2,zi3)TN(0,Σ), in which Σ={σjj}, σjj = 1 and σjj=0.3 for jj, and xi=(1,xi)T, in which xi is simulated from centered and standardized binomial with probability 0.7 for one outcome. We simulate the error terms ɛi from independent N(0, 0.52). Let η=(1,1,1)T, α1=(2,2)T and α2=(0,0)T. Moreover, let wi1=I(zi12+ui1<0) and wi2=1wi1, where uiN(0,1).

Table 1:

The sample mean, median and standard deviation (s.d.) of Kˆ, the Rand Index (RI) value and the percentage (per) of Kˆ equaling to the true number of subgroups by MCP and SCAD based on 500 replications with n = 200, 400 in Example 1.

n = 200n = 400
meanmedians.d.RIpermeanmedians.d.RIper
MCP2.1002.0000.3020.7990.8902.0802.0000.2720.8260.920
SCAD2.1002.0000.3030.7990.8902.0802.0000.2730.8250.920

We select the λ value by minimizing the modified BIC given in (16). Table 1 reports the sample mean, median and standard deviation (s.d.) of the estimated number of groups Kˆ, the average value of Rand Index (RI) defined in (17) for measuring clustering accuracy, and the percentage (per) of Kˆ equaling to the true number of subgroups by the MCP and SCAD methods based on 500 simulation realizations with n = 200,400. The median of Kˆ is 2 which is the true number of subgroups for all cases. As n increases, the mean gets closer to 2 and the standard deviation becomes smaller. Moreover, the Rand Index (RI) value and the percentage of correctly selecting the number of subgroups become closer to 1 as n increases.

Table 2:

The sample mean, median and asymptotic standard error (ASE) of the estimators αˆ1 and αˆ2 by MCP and SCAD and oracle estimators αˆ1or and αˆ2or based on 500 replications with n = 200, 400 in Example 1.

n = 200n = 400
meanmedianASEmeanmedianASE
αˆ11MCP2.0162.0190.0462.0122.0140.034
SCAD2.0162.0180.0462.0122.0140.034
αˆ11or1.9961.9970.0472.0092.0090.033
αˆ12MCP1.9271.9290.0451.9471.9450.034
SCAD1.9271.9290.0451.9471.9460.034
αˆ12or1.9901.9890.0461.9991.9990.034
αˆ21MCP0.0160.0160.0530.0060.0100.038
SCAD0.0160.0160.0530.0060.0100.038
αˆ21or0.0060.0100.0550.0090.0090.039
αˆ22MCP0.0870.0860.0540.0860.0840.038
SCAD0.0870.0860.0540.0860.0840.038
αˆ22or– 0.011– 0.0140.0550.0010.0010.039

To further study the estimation accuracy and evaluate the asymptotic properties stated in Section 4, Table 2 presents the sample mean, median and asymptotic standard error (ASE) obtained according to Corollary Corollary 1 of the estimators αˆ1=(αˆ11,αˆ12)T and αˆ2=(αˆ21,αˆ22)T by the MCP and SCAD methods and oracle estimators αˆ1or=(αˆ11or,αˆ12or)T and αˆ2or=(αˆ21or,αˆ22or)T based on 500 replications with n = 200 and 400. The medians and means of αˆ1 and αˆ2 are close to the true values 2 and 0 for all cases. Moreover, the asymptotic standard errors of the penalized estimators αˆ1 and αˆ2 are close to those of the oracle estimators αˆ1or and αˆ2or. This supports the oracle property established in Theorem 2.

Next, we calculate the mean squared error (MSE) of the estimates ηˆ by using the formula ηˆη/q. Figure 2 depicts the boxplots of the MSEs of ηˆ by the MCP and SCAD, respectively, at n = 200 (white) and n = 400 (grey). The MCP and SCAD result in similar MSEs of ηˆ. The MSE values decrease as n increases for both MCP and SCAD.

Figure 2: The boxplots of the MSEs of ηˆ\widehat{\boldsymbol{\eta }} using MCP and SCAD, respectively, with n = 200 (white) and n = 400 (grey) in Example 1.
Figure 2:

The boxplots of the MSEs of ηˆ using MCP and SCAD, respectively, with n = 200 (white) and n = 400 (grey) in Example 1.

Lastly, we fit the two-group logistic-normal mixture model given in (4) using the R package “flexmix”, and obtain the average values of RI, which are 0.516 and 0.518, for n = 200 and 400, respectively, based on 500 simulation realizations. We see that our method leads to higher RI values than the structured mixture modelling approach, even though our method does not need to specify the true number of subgroups beforehand. Table 3 reports the sample mean, median and empirical standard deviation (ESD) of the estimators αˆ1 and αˆ2 by using the clustering result from the mixture modelling approach to fit regressions of the two groups. Clearly, the estimated values from our method given in Table 2 are closer to the true values α1 and α2 than those obtained from the mixture modelling approach given in Table 3. The mixture modelling method requires the indicator function which is wi=I(ziTγ+ϵ>0) given in (4) to be a parametric linear function of zi, so that it fits a mis-specified model for this example. As a result, it leads to biased estimates of α1 and α2.

Table 3:

The sample mean, median and empirical standard deviation (ESD) of the estimators αˆ1 and αˆ2 by the mixture modelling approach based on 500 replications in Example 1.

n = 200n = 400
meanmedianESDmeanmedianESD
αˆ111.8341.8090.2661.8191.7940.179
αˆ121.2541.2620.1371.2631.2660.090
αˆ210.8090.8420.4190.7860.7960.343
αˆ220.9490.9770.2100.9400.9570.151

Example 2

(Three subgroups). We simulate data from the heterogeneous model with three groups:

(19)yi=ziTη+xiTβi+εi,i=1,,n, with βi=α1wi1+α2wi2+α3wi3,

where zi, ɛi and η are simulated in the same way as in Example 1. Let xi=(1,xi)T, where xi is generated from centered and standardized binomial with probability 0.5 for one outcome. Let α1=(c,c)T, α2=(0,0)T and α2=(c,c)T. Moreover, let wi1=I(|zi1+zi2+zi3|<0.9), wi2=I(zi1+zi2+zi30.9) and wi3=1wi1wi2. Let n = 200.

Figure 3 displays the fusiongram for (βˆ21(λ),,βˆ2n(λ)), the elements of the second component in βˆi(λ)’s, against λ values with c = 2. The fusiongrams for SCAD and MCP look similar. They generate three subgroups for λ in a certain interval, and the estimates of the treatment effects are nearly unbiased on this segment. For the LASSO, the estimates merge to a single value quickly due to the overshrinkage of the L1 penalty.

Figure 3: Fusiongram for βˆ21(λ),…,βˆ2n(λ)\widehat{\beta }_{21}(\lambda),\ldots ,\widehat{\beta }_{2n}(\lambda), the second component in βˆ\widehat{\mathit{ \mathbf{\boldsymbol{\beta }}}}i(λ)(\lambda)’s, against λ\lambda with c = 2 in Example 2.
Figure 3:

Fusiongram for βˆ21(λ),,βˆ2n(λ), the second component in βˆi(λ)’s, against λ with c = 2 in Example 2.

Table 4:

The sample mean, median and standard deviation (s.d.) of Kˆ, the Rand Index (RI) value and the percentage (per) of Kˆ equaling to the true number of subgroups by MCP and SCAD based on 500 replications for c = 2, 3 and 4 in Example 2.

c = 2c = 3c = 4
MCPSCADMCPSCADMCPSCAD
mean3.4503.4703.4003.4103.3503.370
median3.0003.0003.0003.0003.0003.000
s.d.0.7750.7650.7690.7580.7480.726
RI0.6510.6510.6580.6570.6720.672
per0.6300.6000.6800.6600.8000.760

We next conduct the simulations by selecting λ via minimizing the modified BIC given in (16). Table 4 reports the sample mean, median and standard deviation (s.d.) of the estimated number of groups Kˆ, the Rand Index (RI) defined in (17) for measuring clustering accuracy, and the percentage (per) of Kˆ equaling to the true number of subgroups by the MCP and SCAD methods based on 500 simulation realizations for c = 2, 3 and 4. We observe that the median values of Kˆ are 3, which is the true number of subgroups, for all cases. Moreover, as the c value becomes larger, the Rand Index (RI) value and the percentage of correctly selecting the number of subgroups increase.

Example 3

(No treatment heterogeneity). We generate data from a model with homogeneous treatment effects given by yi=ziTη+xiTβ+εi,i=1,,n, where zi , xi, ɛi and η are simulated in the same way as in Example 1. Set β=(β1,β2)T=(2,2) and n = 200. We use our proposed penalized estimation method to fit the model. The sample mean of the estimated number of groups Kˆ is 1.19 and 1.18, respectively, for the MCP and SCAD methods, and the sample median is 1 for both methods based on 500 replications.

Table 5:

The empirical bias (Bias) for the estimates of β and η, and the average asymptotic standard error (ASE) calculated according to Corollary 1 and the empirical standard error (ESE) for the MCP and SCAD methods and oracle estimator (ORACLE) in Example 3.

β1β2η1η2η3
Bias– 0.0050.003– 0.0020.0070.003
MCPASE0.0350.0370.0360.0370.037
ESE0.0340.0360.0410.0380.041
Bias– 0.0040.002– 0.0010.0070.003
SCADASE0.0350.0360.0360.0370.037
ESE0.0340.0360.0400.0370.041
Bias– 0.0040.002– 0.0010.0060.004
ORACLEASE0.0360.0360.0370.0380.038
ESE0.0360.0370.0390.0340.039

To evaluate the asymptotic normality established in Corollary 1, Table 5 lists the empirical bias (Bias) for the estimates of β and η, the average asymptotic standard error (ASE) calculated according to Corollary 1, the empirical standard error (ESE) for the MCP, SCAD as well as the oracle estimator (ORACLE). The bias, ASE and ESE for the estimates of β by the MCP and SCAD are calculated based on the replications with the estimated number of groups equal to one. For other cases, they are calculated based on the 500 replications. The biases are small for all cases. The ESE and ASE are similar for both MCP and SCAD, and they are also close to the corresponding values for the oracle estimator. These results indicate that the proposed method works well for the homogeneous model.

6.2 Empirical example

We apply our method to the AIDS Clinical Trials Group Study 175. This study was a randomized clinical trial to compare zidovudine with other three therapies including zidovudine and didanosine, zidovudine and zalcitabine, and didanosine in adults infected with the human immunodeficiency virus type I. We use the log-transformed values of the CD4 counts at 20± 5 weeks as the responses yi [31]. For illustration of our method, we use didanosine as the treatment variable and use a binary variable xi for this treatment. Thus, the coefficient of xi represents the difference in the treatment effect between the two therapies: didanosine and zidovudine. We randomly select 500 patients from the study to consist of our dataset. Moreover, we include 5 baseline covariates in the model, which are age (years), weight (kg), Karnofsky score, log-transformed CD8 counts at baseline, and gender (0 = female, 1 = male).

To see possible heterogeneity in treatment effects, we first fit the homogeneous linear model (3) by using yi as the response and the 5 baseline covariates and the treatment variable as predictors and obtain the residuals, so that the effects of the baseline covariates are controlled. In Figure 4 (a), it shows the kernel density plot of the residuals. We can see that the distribution has multiple modes for these patients, which indicates possible heterogeneous treatment effects.

Figure 4: The kernel density plot of the residuals (a) by fitting the homogeneous linear model (3) and (b) by fitting the heterogeneous linear model (2).
Figure 4:

The kernel density plot of the residuals (a) by fitting the homogeneous linear model (3) and (b) by fitting the heterogeneous linear model (2).

Next, we fit the heterogeneous model yi=ziTη+xiTβi+εi,i=1,,n, where βi=(β1i,β2i)T, zi=(zi1,zi5)T which are the 5 covariates described above, and xi=(1,xi)T, in which xi is the binary variable for the treatment didanosine. All of the predictors are centered and standardized before applying the regularization methods. Figure 5 displays the fusiongrams for the estimated treatment coefficients, (βˆ21(λ),,βˆ2n(λ)), by MCP. We obtain similar patterns by using SCAD. The fusiongrams suggest the existence of heterogeneity in the treatment effects. In particular, the modified BIC criterion selected the λ value in the region that gives two subgroups with different treatment effects. Moreover, Figure 4 (b) shows the kernel density plot of the residuals by fitting the heterogeneous linear model (2). It shows a uni-modal distribution.

Figure 5: Fusiongram for βˆ\widehat{\mathit{\boldsymbol{\beta }}}2(λ)=_{2}(\lambda)=βˆ21(λ),…,βˆ2n(λ)\widehat{\beta }_{21}(\lambda),\ldots ,\widehat{\beta }_{2n}(\lambda) against λ\lambda.
Figure 5:

Fusiongram for βˆ2(λ)=βˆ21(λ),,βˆ2n(λ) against λ.

Table 6:

The estimates (Est.), their standard errors (s.e.) and p-values (P-value) for testing significance of α1 and α2 by the MCP and SCAD methods, and those values of β=α1 by the OLS method.

α11(intercept)α12(trt)α21 (intercept)α22(trt)
MCPEst.5.809– 0.0655.7351.687
s.e.0.0160.0410.0550.110
p-value<0.0010.112<0.001<0.001
SCADEst.5.809– 0.0655.7371.688
s.e.0.0160.0420.0550.110
p-value<0.0010.122<0.001<0.001
OLSEst.5.7870.077
s.e.0.0190.040
p-value<0.0010.054

Let αˆ1=(αˆ11,αˆ12)T and αˆ2=(αˆ21,αˆ22)T be the estimated coefficients for xi in the two identified groups Gˆ1 and Gˆ2, respectively, so that βˆi=αˆ1 for iGˆ1 and βˆi=αˆ2 for iGˆ2. In Table 6, we report the estimates (Est.) and their standard errors (s.e.) of α1=(α11,α12)T and α2=(α21,α22) by the MCP and SCAD methods, and those values of β=α1 by the OLS method. Note that the first and second components represent the coefficients for intercept and the treatment variable (trt), respectively. We also report the p-values (p-value) for testing whether each coefficient is zero or not. By the OLS method, the p-value for testing whether the coefficient of the therapy (didanosine), α12, is zero or not is greater than 0.05. This result indicates that the effect of the therapy (didanosine) is not significantly different from that of the standard treatment (zidovudine) by assuming that the treatment effect is homogeneous. In comparison, by the proposed concave fusion approach, the p-value for testing significance of the treatment coefficient, α12, for one group is greater than 0.05, but the p-value for the other group is less than 0.001. This result implies that although the treatment has no statistically significant effect on one group of patients, its effect on the other group is prominent.

Next, we employ the bootstrapping method given in Section 5 to test homogeneity of the treatment effect. We obtain the p-value <0.01. Thus, the heterogeneity of the treatment effect is further confirmed by the inference procedure. Lastly, we apply the method to model (2) with all three therapies by letting xi=(1,xi1,xi2,xi3)T, where xi1 = zidovudine and didanosine (trt1), xi2 = zidovudine and zalcitabine (trt2), and xi3 = didanosine (trt3). Let αˆ1=(αˆ11,αˆ12,αˆ13,αˆ14)T and αˆ2=(αˆ21,αˆ22,αˆ23,αˆ24)T be the estimated coefficients for xi in the two identified groups Gˆ1 and Gˆ2, respectively. In Table 7, we report the estimates (Est.) and their standard errors (s.e.) of α1=(α11,α12,α13,α14)T and α2=(α21,α22,α23,α24)T by the MCP and SCAD methods, and those values of β=α1 by the OLS method. We obtain two subgroups as well by our proposed method. Moreover, the two methods, MCP and SCAD, have similar results, and the effects of the three therapies are more significantly different from the standard therapy in one group than they are in the other group.

Table 7:

The estimates (Est.), their standard errors (s.e.) and p-values (P-value) for testing significance of α1 and α2 by MCP and SCAD, and those values obtained by OLS.

α11α12α13α14α21α22α23α24
MCPEst.3.9850.1140.1040.1265.5251.1161.2001.214
s.e.0.0860.0490.0510.0520.0740.1030.1060.107
p-value<0.0010.0200.0410.015<0.001<0.001<0.001<0.001
SCADEst.3.9850.1140.1050.1275.5241.1161.1991.214
s.e.0.0860.0470.0510.0520.0740.1020.1060.107
p-value<0.0010.0150.0400.015<0.001<0.001<0.001<0.001
OLSEst.5.8460.0690.0670.065
s.e.0.0190.0240.0240.024
p-value<0.0010.0040.0050.007

7 Discussion

It will be of interest to extend the proposed method to a more general class of regression problems including generalized linear and Cox regression models. Moreover, it is also possible to relax the linearity assumption on the covariates zi by considering a nonparametric or semiparametric functional form of zi. For these more complicated models, the ADMM algorithm can still be employed with some modifications. However, further work is needed to study the theoretical properties. We refer to Section A.2 for detailed discussions on these extensions. Also, we have assumed that the number of treatment variables p and the number of confounding variables q are both smaller than the sample size n, although we allow p and q to diverge with n. For high-dimensional problems with p > n or q > n, a sparsity condition on the coefficients would be needed to ensure identifiability of the model. Further studies are needed to develop computational algorithms and theoretical results for the high-dimensional setting.

Funding statement: The research of Ma is supported in part by U.S. NSF grant DMS-1712558 and NIH grant R01 ES024732-03.

A Appendix

In the Appendix, we give the computational details, the convergence property of the ADMM algorithm and the technical proofs for Theorem 1–Theorem 3.

A.1 Computation

A.1.1 ADMM with concave penalties

We derive an ADMM algorithm for computing the solution (7). The key idea is to introduce a new set of parameters δij=βiβj. Then, we can reformulate the problem of minimizing (6) as that of minimizing

(20)L0(η,β,δ)=12i=1n(yiziTηxiTβi)2+i<jpγ(δij,λ),subject to βiβjδij=0,

where δ={δijT,i<j}T. Let a,b=aTb be the inner product of two vectors a and b with the same dimension. The augmented Lagrangian is

(21)L(η,β,δ,υ)=L0(η,β,δ)+i<jυij,βiβjδij+ϑ2i<jβiβjδij2,

where the dual variables υ={υijT,i<j}T are Lagrange multipliers and ϑ is a penalty parameter. We then compute the estimates of (η,β,δ,υ) through iterations using the ADMM.

For a given value of δm and υm at step m, the iteration goes as follows:

(22)(ηm+1,βm+1)=argminη,βL(η,β,δm,υm),
(23)δm+1=argminδL(ηm+1,βm+1,δ,υm),
(24)υijm+1=υijm+ϑ(βim+1βjm+1δijm+1).

In (22), the problem is equivalent to the minimization of the function

f(η,β)=12i=1n(yiziTηxiTβi)2+ϑ2i<jβiβjδijm+ϑ1υijm2+C,

where C is a constant independent of (η,β). Some algebra shows that we can write f(η,β) as

(25)f(η,β)=12Zη+Xβy2+ϑ2Aβδm+ϑ1υm2+C,

where A=DIp. Here D={(eiej),i<j}T with ei being the ith unit n × 1 vector whose ith element is 1 and the remaining ones are 0, Ip is a p × p identity matrix and is the Kronecker product.

Thus for given δm and υm at the mth step, the updates βm+1 and ηm+1 are

(26)βm+1=(XTQZX+ϑATA)1[XTQZy+ϑAT(δmϑ1υm)],ηm+1=(ZTZ)1ZT(yXβm+1),

where QZ=InZ(ZTZ)1ZT. Since

AT(δmϑ1υm)=(DTIp)(δmϑ1υm)=vec((Δmϑ1Υm)D),

where Δm={δijm,i<j}p×n(n1)/2 and Υm={υijm,i<j}p×n(n1)/2, then we have

(27)βm+1=(XTQZX+ϑATA)1[XTQZy+ϑvec((Δmϑ1Υm)D)].

In (23), after discarding the terms independent of δ, we need to minimize

(28)ϑ2ζijmδij2+pγ(δij,λ)

with respect to δij, where ζijm=βimβjm+ϑ1υijm. This is a groupwise thresholding operator corresponding to pγ.

For the L1 penalty, the solution is

(29)δijm+1=S(ζijm,λ/ϑ),

where S(z,t)=(1t/z)+z is the groupwise soft thresholding operator. Here (x)+ = x if x > 0 and = 0 , otherwise.

For the MCP with γ > 1/ϑ, the solution is

(30)δijm+1=S(ζijm,λ/ϑ)11/(γϑ)if ζijmγλ,ζijif ζijm>γλ.

For the SCAD penalty with γ > 1/ϑ + 1, the solution is

(31)δijm+1=S(ζijm,λ/ϑ)if  ζijmλ+λ/ϑ,S(ζijm,γλ/((γ1)ϑ))11/((γ1)ϑ)if λ+λ/ϑ<ζijmγλ,ζijmif ζijm>γλ.

Finally, the update of υij is given in (24).

We summarize the above analysis in Algorithm 1.

Algorithm 1 ADMM for concave fusion
Require: Initialize δ0, υ0.
 1: form=0,1,2,do
 2:  Compute βm+1 using (6)
 3:  Compute ηm+1 (26)
 4:  Compute δm+1 (29), (30) or (13)
 5:  Compute υm+1 using (24)
 6:  if convergence criterion is met, then
 7:   Stop and denote the last iteration by (ηˆ(λ),βˆ(λ)),
 8:  else
 9:   m = m + 1.
10:  end if
11: end for
Ensure: Output

Remark 7

Our algorithm enables us to haveδˆij=0for a sufficiently largeλ. We put observationsiandjin the group with the same treatment effect ifδˆij=0. As a result, we haveKˆestimated groupsGˆ1,,GˆKˆ. The estimated treatment effect for thekthgroup isαˆk=|Gˆk|1iGˆkβˆi,where|Gˆk|is the cardinality ofGˆk.

Remark 8

In the algorithm, we require the invertibility ofXTQZX+ϑATA. It can be derived thatAA=nInp(1nIp)(1nIp)T. For any nonzero vectora=(aij,1in,1jp)TIRnp, we haveaT(ϑATA)a0andaT(XTQZX)a0. Note thataT(ϑATA)a=0if and only ifaij=ajfor alli. Whenaij=ajfor alli, we haveaT(XTQZX)a>0given thatλmin(i=1n(xiT,ziT)T(xiT,ziT))>0, which is a common assumption that the design matrix needs to satisfy in linear regression. Therefore, XTQZX+ϑATAis invertible.

Remark 9

We track the progress of the ADMM based on the primal residualrm+1=Aβm+1δm+1. We stop the algorithm whenrm+1is close to zero such thatrm+1<afor some small valuea.

A.1.2 Initial value and computation of the solution path

To start the ADMM algorithm described above, it is important to find a reasonable initial value. For this purpose, we consider the ridge fusion criterion given by

LR(η,β)=12Zη+Xβy2+λ21i<jnβiβj2,

where λ* is the tuning parameter having a small value. We use λ* = 0.001 in our analysis. Then LR(η,β) can be written as

LR(η,β)=12Zη+Xβy2+λ2Aβ2,

where A is defined in (25). The solutions are

βR(λ)=(βR,1T(λ),,βR,nT(λ))T=(XTQZX+λATA)1XTQZy,ηR(λ)=(ZTZ)1ZT(yXβR(λ)),

where QZ is given in (27). Next, we assign the subjects to K* groups by ranking the median values of βR,iT(λ). We let K=n1/2 to ensure that it is sufficiently large, where a denotes the largest integer no greater than a. We then find the initial estimates η0 and β0 from least squares regression with K* groups. Let the initial estimates δij0=βi0βj0 and υ0=0.

To compute the solution path of η and β along the λ values, we use the warm start and continuation strategy to update the solutions. Let [λmin,λmax] be the interval on which we compute the solution path, where 0λmin<λmax<. Let λmin=λ0<λ1<<λKλmax be a grid of λ values in [λmin,λmax]. Compute (ηˆ(λ0),βˆ(λ0)) using (η0,β0) as the initial value. Then compute (ηˆ(λk),βˆ(λk)) using (ηˆ(λk1),βˆ(λk1)) as the initial value for each k = 1,…,K.

Note that we start from the smallest λ value in computing the solution path. This is different from the coordinate descent algorithms for computing the solution path in penalized regression problems [32], where the algorithms start at the λ value that forces all the coefficients to zero.

A.1.3 Convergence of the algorithm

We next derive the convergence properties of the ADMM algorithm.

Proposition 1

Letrm=Aβmδmandsm+1=ϑAT(δm+1δm)be the primal residual and the dual residual in the ADMM described above, respectively. It holds thatlimmrm2=0andlimmsm2=0for the MCP and SCAD penalties.

This proposition shows that the primal feasibility and the dual feasibility are achieved by the algorithm.

Proof. By the definition of δm+1, we have

L(ηm+1,βm+1,δm+1,υm)L(ηm+1,βm+1,δ,υm)

for any δ. Define

fm+1=infAβm+1δ=0{12yZηm+1Xβm+12+i<jpγ(|δij|,λ)}=infAβm+1δ=0L(ηm+1,βm+1,δ,υm). 

Then

L(ηm+1,βm+1,δm+1,υm)fm+1.

Let t be an integer. Since υm+1=υm+ϑ(Aβm+1δm+1), then we have

υm+t1=υm+ϑi=1t1(Aβm+iδm+i),

and thus

L(ηm+t,βm+t,δm+t,υm+t1)=12yZηm+tXβm+t2+(υm+t1)T(Aβm+tδm+t)+ϑ2||Aβm+tδm+t||2+i<jpγ(|δijm+t|,λ)=12yZηm+tXβm+t2+(υm)T(Aβm+tδm+t)+ϑi=1t1(Aβm+iδm+i)T(Aβm+tδm+t)+ϑ2||Aβm+tδm+t||2+pγ(|δijm+t|,λ)fm+t.

Since the objective function L(η,β,δ,υ) is differentiable with respect to (η,β) and is convex with respect to δ, by applying the results in Theorem 4.1 of [33], the sequence (ηm,βm,δm) has a limit point, denoted by (η,β,δ). Then we have

f=limmfm+1=limmfm+t=infAβδ=0{12yZηXβ2+i<jpγ(|δij|,λ)},

and for all t ≥ 0

limmL(μm+t,βm+t,ηm+t,υm+t1)=12yZηXβ2+i<jpγ(|δij|,λ)+limm(υm)T(Aβδ)+(t12)ϑAβδ2f.

Hence limmrm2=r=Aβδ2=0.

Since βm+1 minimizes L(ηm,β,δm,υm) by definition, we have that

L(ηm,β,δm,υm)/β=0,

and moreover,

L(ηm,βm+1,δm,υm)/β=XT(Zηm+Xβm+1y)+ATυm+ϑAT(Aβm+1δm)=XT(Zηm+Xβm+1y)+AT(υm+ϑ(Aβm+1δm))=XT(Zηm+Xβm+1y)+AT(υm+1ϑ(Aβm+1δm+1)+ϑ(Aβm+1δm))=XT(Zηm+Xβm+1y)+ATυm+1+ϑAT(δm+1δm).

Therefore,

sm+1=ϑAT(δm+1δm)=(XT(Zηm+Xβm+1y)+ATυm+1).

Since Aβδ2=0,

limmL(ηm,βm+1,δm,υm)/β=limmXT(Zηm+Xβm+1y)+ATυm+1=XT(Zη+Xβy)+ATυ=0.

Therefore, limmsm+1=0.

A.2 Extension to nonlinear models

In this paper, we focus on the linear regression model (2) to introduce our proposed method for exploring treatment heterogeneity. It is worth mentioning that our method can be readily extended to semi-parametric models by relaxing the linearity assumption on zi. Considering the model:

yi=m(zi)+xiTβi+εi,i=1,,n,

where m() is an unknown function of zi. This model has no constraint on the functional form of zi. Following the approach in Ma et al. [34], we can estimate m() by tensor-product regression splines weighted by categorical kernel functions, where the splines are used for the continuous predictors and the categorical kernels are for the discrete predictors, respectively. Then the objective function for obtaining the estimates consists of two parts similar as given in (6). The first part is a weighted least squares criterion as presented in eq. (2) of Ma et al. [34], and the second part contains the same penalty functions as given in (6). As a result, the proposed ADMM algorithm proposed in Section A.1 can be applied. Moreover, we can also assume semi-parametric structures on m() for further dimensionality reduction while retaining model flexibility. For example, when zi is a set of continuous variables, we can assume the additive structure:

m(zi)=m1(zi1)++mq(ziq),

where mk( · ) for k = 1,...,q are unknown functions. Also the partially linear additive structure is another popular semi-parametric model, given as

m(zi)=m1(zi1)++mq1(ziq1)+zi2Tη,

where zi=(zi1T,zi2T)T, zi1=(zi1,...,ziq1)T and zi2=(zi,q1+1,...,ziq)T. If we use regression splines to approximate the unknown functions, the same ADMM algorithm given in Section A.1 can be applied to obtain the parameter estimators with the components in zi replaced by their spline basis functions. We refer to Ma et al. [35] for the details of using regression splines to estimate unknown functions in these settings.

It is also of interest to extend the proposed method to the case with discrete responses. For a general scenario, one may fit a generalized linear model:

E(yi|zi,xi)=μi=g1(ziTη+xiTβi),i=1,,n,

where g is a known monotone link function. For obtaining the estimates of the parameters, we can consider the negative quasi-likelihood function

Q(μ,y)=μy{(yξ)/V(ξ)}dξ,

where V( · ) is the conditional variance of y given (z,x). Then the parameter estimates can be obtained by minimizing

i=1nQ(g1(ziTη+xiTβi),yi)+1i<jnp(βiβj,λ).

Because Q(g1(ziTη+xiTβi),yi) is differentiable with respect to (η,βi), the ADMM algorithm given in Section A.1 can be straightforwardly applied.

A.3 Proof of Theorem 1

In this section we show the results in Theorem 1. For every βMG, it can be written as β=Wα. Recall U=(Z,XW). We have

ηˆorαˆor=argminηRq,αRKp12yZηXβ=argminηRq,αRKp12yZηXWα2.

Thus

ηˆorαˆor=[(Z,XW)T(Z,XW)]1(Z,XW)Ty=(UTU)1UTy

Then

ηˆorη0αˆorα0=(UTU)1UTε.

Hence

(32)ηˆorη0αˆorα0(UTU)1UTε.

By Condition (C1), we have

(33)(UTU)1C11Gmin1.

Moreover

P(UTε>Cnlogn)P(ZTε>Cnlogn)+P((XW)Tε>Cnlogn),

for some constant 0<C<. Since XW=xiT1{iGk}i=1,k=1n,K, we have

(XW)Tε=supj,k|i=1nxijεi1{iGk}|

and by union bound, Condition (C1) that i=1nxij21{iGk}=Gk and Condition (C3),

P(XW)Tε>Cnlognj=1,k=1p,KP|i=1nxijεi1{iGk}|>Cnlognj=1,k=1p,KP|i=1nxijεi1{iGk}|>|Gk|Clogn2Kpexp(c1C2logn)=2Kpnc1C2.

By union bound, Condition (C1) that Zk=n, where Zk is the kth column of Z, and Condition (C3),

PZTε>Cnlognk=1qP|ZkTε|>nClogn2qexp(c1C2logn)=2qnc1C2.

It follows that

P(UTε>Cnlogn)2(Kp+q)nc1C2.

Since UTεq+KpUTε, then

(34)P(UTε>Cq+Kpnlogn)2(Kp+q)nc1C2.

Therefore, by (32), (33) and (34), we have with probability at least 12(Kp+q)nc1C2,

ηˆorη0αˆorα0CC11q+KpGmin1nlogn.

The result (9) in Theorem 1 is proved by letting C=c11/2. Moreover,

βˆorβ02=k=1KiGkαˆkorαk02Gmaxk=1Kαˆkorαk02=Gmaxαˆorα02Gmaxϕn2, 

and

supiβˆiorβi0=supkαˆkorαk0αˆorα0ϕn.

Let U=(U1,,Un)T, and Ξn=UTU. Then

anT((ηˆorη0)T,(αˆorα0)T)T=i=1nanTΞn1Uiεi.

Hence

E{anT((ηˆorη0)T,(αˆorα0)T)T}=0,

and for any vector anIRq+Kp with ||an||=1,

var{anT((ηˆorη0)T,(αˆorα0)T)T}=σn2(an)=σ2anT(UTU)1anσ2anTΞn1an. 

Moreover, for any ε > 0,

i=1nE[(anTΞn1Uiεi)21{|anTΞn1Uiεi|>∈σn(an)}]i=1n{E(anTΞn1Uiεi)4}1/2[P{|anTΞn1Uiεi|>∈σn(an)}]1/2. 

Since E(εi4)c for some constant c(0,) by Condition (C2), then

i=1n{E(anTΞn1Uiεi)4}1/2c1/2anTΞn1an.

Moreover,

maxiP|anTΞn1Uiεi|>σn(an)maxiE(anTΞn1Uiεi)2/{2σn2(an)}c2(q+Kp)anTΞn1Ξn1an/(anTΞn1an)

for some constant c(0,). Therefore, by the above results, we have

σn2(an)i=1nE[(anTΞn1Uiεi)21{|anTΞn1Uiεi|>∈σn(an)}]{σ2anTΞn1an}1c1/2anTΞn1an{c2(q+Kp)anTΞn1Ξn1an/(anTΞn1an)}1/2c1/2c1/2C11/21σ1(q+Kp)1/2Gmin1/2=o(1).

Then, the result (11) follows from Lindeberg–Feller Central Limit Theorem.

A.4 Proof of Theorem 2

In this section we show the results in Theorem 2. Define

Ln(η,β)=12yZηXβ2,Pn(β)=λi<jρ(βiβj),LnG(η,α)=12yZηXWα2,PnG(α)=λk<k|GkGk|ρ(αkαk),

and let

Qn(η,β)=Ln(η,β)+Pn(β),QnG(η,α)=LnG(η,α)+PnG(α).

Let T:MGRKp be the mapping that T(β) is the Kp × 1 vector consisting of K vectors with dimension p and its kth vector component equals to the common value of βi for iGk. Let T:RnpRKp be the mapping that T(β)={|Gk|1iGkβiT,k=1,,K}T. Clearly, when βMG, T(β)=T(β).

By calculation, for every βMG, we have Pn(β)=PnG(T(β)) and for every αRK, we have Pn(T1(α))=PnG(α). Hence

(35)Qn(η,β)=QnG(η,T(β)),QnG(η,α)=Qn(η,T1(α)).

Consider the neighborhood of (η0,β0):

Θ={ηRq,βRKp:ηη0ϕn,supiβiβi0ϕn}.

By Theorem 1, there exists an event E1 in which

ηˆorη0ϕn,supiβˆiorβi0ϕn

and P(E1C)2(q+Kp)n1. Hence (ηˆor,βˆor)Θ in E1. For any βRnp, let β=T1(T(β)) . We show that (ηˆor,βˆor) is a strictly local minimizer of the objective function (6) with probability approaching 1 through the following two steps.

  1. In the event E1, Qn(η,β)>Qn(ηˆor,βˆor) for any (ηT,βT)TΘ and ((η)T,(β)T)T((ηˆor)T,(βˆor)T)T.

  2. There is an event E2 such that P(E2C)2n1. In E1E2, there is a neighborhood of ((ηˆor)T,(βˆor)T)T, denoted by Θn such that Qn(η,β)Qn(η,β) for any ((η)T,(β)T)TΘnΘ for sufficiently large n.

Therefore, by the results in (i) and (ii), we have Qn(η,β)>Qn(ηˆor,βˆor) for any (ηT,βT)TΘnΘ and ((η)T,(β)T)T((ηˆor)T,(βˆor)T)T in E1E2, so that ((ηˆor)T,(βˆor)T)T is a strict local minimizer of Qn(η,β) (6) over the event E1E2 with P(E1E2)12(q+Kp+1)n1 for sufficiently large n.

In the following we prove the result in (i). We first show PnG(T(β))=Cn for any βΘ, where Cn is a constant which does not depend on β. Let T(β)=α=(α1T,,αKT)T. It suffices to show that αkαk>aλ for all k and k. Then by Condition (C2), ρ(αkαk) is a constant, and as a result PnG(T(β)) is a constant. Since

αkαkαk0αk02supkαkαk0,

and

(36)supkαkαk02=supk|Gk|1iGkβiαk02=supk|Gk|1iGk(βiβi0)2=supk|Gk|2iGk(βiβi0)2supk|Gk|1iGk(βiβi0)2supiβiβi02ϕn2,

then for all k and k

αkαkαk0αk02supkαkαk0bn2ϕn>aλ,

where the last inequality follows from the assumption that bn>aλϕn. Therefore, we have PnG(T(β))=Cn, and hence QnG(η,T(β))=LnG(η,T(β))+Cn for all (ηT,βT)TΘ. Since ((ηˆor)T,(αˆor)T)T is the unique global minimizer of LnG(η,α), then LnG(η,T(β))>LnG(ηˆor,αˆor) for all (ηT,(T(β))T)T((ηˆor)T,(αˆor)T)T and hence QnG(η,T(β))>QnG(ηˆor,αˆor) for all T(β)αˆor. By ( 35), we have QnG(ηˆor,αˆor)=Qn(ηˆor,βˆor) and QnG(η,T(β))=Qn(η,T1(T(β)))=Qn(η,β). Therefore, Qn(η,β)>Qn(ηˆor,βˆor) for all ββˆor, and the result in (i) is proved.

Next we prove the result in (ii). For a positive sequence tn, let Θn={βi:supiβiβˆiortn}. For (ηT,βT)TΘnΘ, by Taylor’s expansion, we have

Qn(η,β)Qn(η,β)=Γ1+Γ2,

where

Γ1=(yZηXβm)TX(ββ)Γ2=i=1nPn(βm)βiT(βiβi).

and βm=αβ+(1α)β for some constant α ∈(0, 1). Moreover,

(37)Γ2=λ{j>i}ρ(βimβjm)βimβjm1(βimβjm)T(βiβi)+λ{j<i}ρ(βimβjm)βimβjm1(βimβjm)T(βiβi)=λ{j>i}ρ(βimβjm)βimβjm1(βimβjm)T(βiβi)+λ{i<j}ρ(βjmβim)βjmβim1(βjmβim)T(βjβj)=λ{j>i}ρ(βimβjm)βimβjm1(βimβjm)T{(βiβi)(βiβi)}.

When i,jGk, βi=βj, and βimβjm=α(βiβj). Thus,

Γ2=λk=1K{i,jGk,i<j}ρ(||βimβjm||)βimβjm||1(βimβjm)T(βiβj)+λk<k{iGk,jGk}ρ(||βimβjm||)||βimβjm||1(βimβjm)T{(βiβi)(βjβj)}. 

Moreover,

(38)supiβiβi02=supkαkαk02ϕn2,

where the last inequality follows from (36). Since βim is between βi and βi,

(39)supiβimβi0αsupiβiβi0+(1α)supiβiβi0αϕn+(1α)ϕn=ϕn.

Hence for kk, iGk,jGk,

βimβjmminiGk,jGkβi0βj02maxiβimβi0bn2ϕn>aλ,

and thus ρ(βimβjm)=0. Therefore,

Γ2=λk=1K{i,jGk,i<j}ρ(βimβjm)βimβjm1(βimβjm)T(βiβj)=λk=1K{i,jGk,i<j}ρ(βimβjm)βiβj,

where the last step follows from βimβjm=α(βiβj). Furthermore, by the same reasoning as (36), we have

supiβiβˆior=supkαkαˆkor2supiββˆior.

Then

supiβimβjm2supiβimβi2supiβiβi2(supiβiβˆior+supiβiβˆior)4supiβiβˆior4tn.

Hence ρ(βimβjm)ρ(4tn) by concavity of ρ( · ). As a result,

(40)Γ2k=1K{i,jGk,i<j}λρ(4tn)βiβj.

Let

Q=(Q1T,,QnT)T=[(yZηXβm)TX]T.

Then

(41)Γ1=QT(ββ)=k=1K{i,jGk}QiT(βiβj)|Gk|=k=1K{i,jGk}QiT(βiβj)2|Gk|k=1K{i,jGk}QiT(βiβj)2|Gk|=k=1K{i,jGk}(QjQi)T(βjβi)2|Gk|=k=1K{i,jGk,i<j}(QjQi)T(βjβi)|Gk|.

Moreover,

Qi=(yiziTηxiTβim)xi=(εi+ziT(η0η)+xiT(βi0βim))xi,

and then

supiQisupixi(ε+ziη0η+xiβi0βim)

By Condition (C1) that supixiC2p and supiziC3q, ( 39) that supiβi0βimϕn and η0ηϕn, we have

supiQiC2p(ε+C3qϕn+C2pϕn).

By Condition (C3)

P(ε>2c11logn)i=1nP(|εi|>2c11logn)2n1.

Thus there is an event E2 such that P(E2C)2n1, and over the event E2,

supiQiC2p(2c11logn+C3qϕn+C2pϕn).

Then

(42)|(QjQi)T(βjβi)|Gk|||Gmin|1QjQiβiβj|Gmin|12supiQiβiβj2C2|Gmin|1p(2c11logn+C3qϕn+C2pϕn)βiβj.

Therefore, by (40), (41) and (42), we have

Qn(η,β)Qn(η,β)k=1K{i,jGk,i<j}{λρ(4tn)2C2|Gmin|1p(2c11logn+C3qϕn+C2pϕn)}||βiβj||.

Let tn = o(1), then ρ(4tn)1. Since λϕn, p = o(n), and |Gmin|1p=o(1), then λ|Gmin|1plogn, λ|Gmin|1pq and λ|Gmin|1pϕn. Therefore, Qn(η,β)Qn(η,β)0 for sufficiently large n, so that the result in (ii) is proved.

A.5 Proof of Theorem 3

In this section we show the results in Theorem 3. The proofs of (13) and (14) follow the same arguments as the proof of Theorem 1 by letting X˜=x and |Gmin|=n, and thus they are omitted. Next, we will show (15). It follows similar procedures as the proof of Theorem 2 with the details given below. Define M={βIRnp:β1==βn}. For each βM, we have βi=α for all i. Let T:MRp be the mapping that T(β) is the p × 1 vector equal to the common vector α. Let T:RnpRp be the mapping that T(β)={n1i=1nβi. Clearly, when βM, T(β)=T(β). Consider the neighborhood of (η0,β0):

Θ={ηRq,βRp:ηη0ϕn,supiβiβi0ϕn},

where ϕn=c11/2C11q+pn1logn. By the result in (13), there exists an event E1 such that on the event E1,

ηˆorη0ϕn,supiβˆiorβi0ϕn,

and P(E1C)2(q+p)n1. Hence (ηˆor,βˆor)Θ on the event E1. For any βRnp, let β=T1(T(β)) . We show that (ηˆor,βˆor) is a strictly local minimizer of the objective function (6) with probability approaching 1 through the following two steps.

(i). On the event E1, Qn(η,β)>Qn(ηˆor,βˆor) for any (ηT,βT)TΘ and ((η)T,(β)T)T((ηˆor)T,(βˆor)T)T.

(ii). There is an event E2 such that P(E2C)2n1. On E1E2, there is a neighborhood of ((ηˆor)T,(βˆor)T)T, denoted by Θn such that Qn(η,β)Qn(η,β) for any ((η)T,(β)T)TΘnΘ for sufficiently large n.

Therefore, by the results in (i) and (ii), we have Qn(η,β)>Qn(ηˆor,βˆor) for any (ηT,β T)TΘnΘ and ((η)T,(β)T)T((ηˆor)T,(βˆor)T)T in E1E2, so that ((ηˆor)T,(βˆor)T)T is a strict local minimizer of Qn(η,β) ( 6) on the event E1E2 with P(E1E2)12(q+p+1)n1 for sufficiently large n.

By the definition of ((ηˆor)T,(βˆor)T)T, we have 12yZηXβ2>12yZηˆorXβˆor2 for any ((η)T,(β)T)TΘ and ((η)T,(β)T)T((ηˆor)T,(βˆor)T)T. Moreover, since pγ(βˆiorβˆjor,λ)=pγ(βiβj,λ)=0 for 1 ≤ i,j ≤ n, we have Qn(η,β)=12yZηXβ2 and Qn(ηˆor,βˆor)=12yZηˆorXβˆor2. Therefore, Qn(η,β)>Qn(ηˆor,βˆor).

Next we prove the result in (ii). For a positive sequence tn, let Θn={βi:supiβiβˆiortn}. For (ηT,βT)TΘnΘ, by Taylor’s expansion, we have

Qn(η,β)Qn(η,β)=Γ1+Γ2,

where

Γ1=(yZηXβm)TX(ββ)Γ2=i=1nPn(βm)βiT(βiβi).
Pn(β)=λi<jρβiβj, and βm=aβ+(1a)β for some constant a ∈ (0, 1). Moreover, by (37),
Γ2=λ{j>i}ρ(||βimβjm||)||βimβjm||1(βimβjm)T{(βiβi)(βjβj)}=λ{j>i}ρ(||βimβjm||)||βiβj||, 

where the second equality holds due to the fact that βi=βj and βimβjm=a(βiβj). Let T(β)=α. Then, following the same argument as (38), we have

supiβiβi02=αα02supiβiβi02.

Then

supiβimβjm2supiβimβi2supiβiβi2(supiβiβˆior+supiβiβˆior)4supiβiβˆior4tn.

Hence ρ(βimβjm)ρ(4tn) by concavity of ρ( · ). As a result,

(43)Γ2{i<j}λρ(4tn)βiβj.

Let

Q=(Q1T,,QnT)T=[(yZηXβm)TX]T.

By the same reasoning as the proof for (41), we have

(44)Γ1=QT(ββ)=n1{i<j}(QjQi)T(βjβi).

By the same argument as the proof for (42), we have that there is an event E2 such that P(E2C)2n1, and on the event E2,

(45)n1|(QjQi)T(βjβi)2C2n1p(2c11logn+C3qϕn+C2pϕn)||βiβj||.

Therefore, by (43), (44) and (45), we have

Qn(η,β)Qn(η,β){i<j}{λρ(4tn)2C2n1p(2c11logn+C3qϕn+C2pϕn)}βiβj.

Let tn = o(1), then ρ(4tn)1. Since λϕn, p = o(n), and n – 1p = o(1), then λn1plogn, λn1pq and λn1pϕn. Therefore, Qn(η,β)Qn(η,β)0 for sufficiently large n, so that the result in (ii) is proved.

References

[1] Kravitz RL, Duan N, Braslow J. Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages. Milbank Q. 2004;82:661–87.10.1111/j.0887-378X.2004.00327.xSearch in Google Scholar

[2] Sorensen T. Which patients may be harmed by good treatments? Lancet. 1996;348:351–2.10.1016/S0140-6736(05)64988-4Search in Google Scholar

[3] Simon R. Clinical trials for predictive medicine: new challenges and paradigms. Clin Trials. 2010;7:516–24.10.1177/1740774510366454Search in Google Scholar

[4] Zhang Z, Nie L, Soon G, Liu A. The use of covariates and random effects in evaluating predictive biomarkers under a potential outcome framework. Ann Appl Stat. 2014;8:2336–55.10.1214/14-AOAS773Search in Google Scholar

[5] Gail M, Simon R. Testing for qualitative interactions between treatment effects and patient subsets. Biometrics. 1985;41:361–72.10.2307/2530862Search in Google Scholar

[6] Lagakos SW. The challenge of subgroup analysis: Reporting without distorting. N Engl J Med. 2006;354:1667–9.10.1056/NEJMp068070Search in Google Scholar

[7] Rothwell PM. Subgroup analysis in randomized clinical trials: importance, indications and interpretation. Lancet. 365:176–86.10.1016/S0140-6736(05)17709-5Search in Google Scholar

[8] Russek-Cohen E, Simon R. Evaluating treatments when a gender by treatment interaction may exist. Stat Med. 1998;16:455–64.10.1002/(SICI)1097-0258(19970228)16:4<455::AID-SIM382>3.0.CO;2-YSearch in Google Scholar

[9] Cai T, Tian L, Wong PH, Wei LJ. Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics. 2011;12:270–82.10.1093/biostatistics/kxq060Search in Google Scholar

[10] Tian L, Alizadeh AA, Gentles AJ, Tibshirani R. A simple method for estimating interactions between a treatment and a large number of covariates. JASA. 2014;109:1517–32.10.1080/01621459.2014.951443Search in Google Scholar

[11] Zhang Z, Qu Y, Zhang B, Nie L, Soon G. Use of auxiliary covariates in estimating a biomarker-adjusted treatment effect model with clinical trial data. Stat Meth Med Res. 2017;25:2103–19.10.1177/0962280213515572Search in Google Scholar PubMed

[12] Imai K, Ratkovic M. Estimating treatment effect heterogeneity in randomzied program evaluation. Ann Appl Stat. 2013;7:443-70.10.1214/12-AOAS593Search in Google Scholar

[13] Su S, Meneses K, McNees P, Johnson WO. Interaction trees: exploring the differential effects of an intervention programme for breast cancer survivors. J R Stat Soc, Ser C. 2011;60:457–74.10.1111/j.1467-9876.2010.00754.xSearch in Google Scholar

[14] Su X, Pena AT, Liu L, Levine RA. Random forests of interaction trees for estimating individualized treatment effects in randomized trials. Stat Med. 2018;37:2547–60.10.1002/sim.7660Search in Google Scholar PubMed PubMed Central

[15] Zhang Z, Wang C, Nie L, Soon G. Assessing the heterogeneity of treatment effects via potential outcomes of individual patients. J R Stat Soc - Ser C. 2013;62:687–704.10.1111/rssc.12012Search in Google Scholar PubMed PubMed Central

[16] Shen J, He X. Inference for subgroup analysis with a structured logistic-normal mixture model. J Am Stat Assoc. 2015;110:303–12.10.1080/01621459.2014.894763Search in Google Scholar

[17] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–60.10.1198/016214501753382273Search in Google Scholar

[18] Zhang C. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38:894–942.10.1214/09-AOS729Search in Google Scholar

[19] Chi EC, Lange K. Splitting methods for convex clustering. J Comput Graphical Stat. 2015;24:994–1013.10.1080/10618600.2014.948181Search in Google Scholar PubMed PubMed Central

[20] Ma S, Huang J. A concave pairwise fusion approach to subgroup analysis. J Am Stat Assoc. 2016 (Accepted for publication).Search in Google Scholar

[21] Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends in Mach Learn. 2011;3:1–122.10.1561/9781601984616Search in Google Scholar

[22] Rubin D. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educational Psychol. 1974;66:688–701.10.1037/h0037350Search in Google Scholar

[23] Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.10.21236/ADA114514Search in Google Scholar

[24] Wang R, Lagakos S, Ware J, Hunter D, Drazen J. Reporting of subgroup analyses in clinical trials. N Engl J Med. 2007;357:2189–94.10.1056/NEJMsr077003Search in Google Scholar PubMed

[25] Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B. 1996;58:267–88.10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar

[26] Härdle W, Mammen E. Comparing nonparametric versus parametric regression fits. Ann Stat. 1993;21:1926-47.10.1214/aos/1176349403Search in Google Scholar

[27] Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–68.10.1093/biomet/asm053Search in Google Scholar PubMed PubMed Central

[28] Schwarz C. Estimating the dimension of a model. Ann Stat. 1978;6:461–4.10.1214/aos/1176344136Search in Google Scholar

[29] Lee ER, Noh H, Park BU. Model selection via bayesian information criterion for quantile regression models. J Am Stat Assoc. 2014;109:216–29.10.1080/01621459.2013.836975Search in Google Scholar

[30] Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Asso. 1971;66:846–50.10.1080/01621459.1971.10482356Search in Google Scholar

[31] Tsiatis AA, Davidian M, Zhang M, Lu X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat Med. 2007;27:4658–77.10.1002/sim.3113Search in Google Scholar PubMed PubMed Central

[32] Friedman J, Hastie T, Höfling H, Tibshirani, R. Pathwise coordinate optimization. Ann Appl Stat. 2007;1:302–32.10.1214/07-AOAS131Search in Google Scholar

[33] Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. J Opt Theo Appl. 2001;109:475–94.10.1023/A:1017501703105Search in Google Scholar

[34] Ma S, Racine JS, Yang L. Spline regression in the presence of categorical predictors. J Appl Econometrics. 2015;30:705–17.10.1002/jae.2410Search in Google Scholar

[35] Ma S, Song Q, Wang L. Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data. Bernoulli. 2013;19:252–74.10.3150/11-BEJ386Search in Google Scholar

Received: 2018-03-01
Revised: 2019-04-29
Accepted: 2019-07-26
Published Online: 2019-09-20

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 28.3.2024 from https://www.degruyter.com/document/doi/10.1515/ijb-2018-0026/html
Scroll to top button