-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchaid.sthlp
355 lines (285 loc) · 28.2 KB
/
chaid.sthlp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
{smcl}
{* *! version 2.1 Dec 16th, 2014 J. N. Luchman}{...}
{cmd:help chaid}
{hline}{...}
{title:Title}
{pstd}
Chi-square automated interaction detection (CHAID){p_end}
{title:Syntax}
{phang}
{cmd:chaid} {depvar} {ifin} {weight} [{cmd:,} {opt {ul on}minn{ul off}ode(integer)} {opt {ul on}mins{ul off}plit(integer)}
{opt {ul on}uno{ul off}rdered(varlist)} {opt {ul on}ord{ul off}ered(varlist)} {opt {ul on}noi{ul off}sily}
{opt {ul on}mis{ul off}sing} {opt {ul on}merga{ul off}lpha(pvalue)} {opt {ul on}respa{ul off}lpha(pvalue)}
{opt {ul on}splta{ul off}lpha(pvalue)} {opt {ul on}maxb{ul off}ranch(integer)} {opt {ul on}dvo{ul off}rdered}
{opt noadj} {opt nodisp} {opt {ul on}pred{ul off}icted} {opt {ul on}imp{ul off}ortance} {opt {ul on}xt{ul off}ile(varlist, xtile_opt)}
{opt {ul on}p{ul off}ermute} {opt svy} {opt {ul on}ex{ul off}haust}]
{phang}
{cmd:fweights} are allowed; see {help weight}{p_end}
{title:Description}
{pstd}
{cmd:chaid} is a recursive partitioning algorithm that searches for an optimal decision tree structure based on
the correspondence between the dependent/response variable and a set of independent/splitting variables. {cmd:chaid} is part
prediction, part clustering estimation command that seeks to reduce uncertainty about the values of/predict a response variable
but simultaneously partitions the dataset into clusters of observations based on the set of splitting variables.
{pstd}
The {cmd:chaid} algorithm cycles through 2 (or, optionally, 3) processes recursively. First, {cmd:chaid} seeks to reduce overfitting the data
by optimally merging together categories/levels of a splitting variable {it:i}. Without the optimal merging step, the splits {cmd:chaid} uncovers
tend to be biased toward predictors with more categories. In the merging step, {cmd:chaid} chooses a splitting variable {it:i} of the {it:p} total
splitting variables and deciphers the total number of possible combinations {it:k} of 2 levels {it:j} of splitting variable {it:i} that are
"mergable" according to the splitting variable type (for unordered splitting variables, all combinations of two levles are possible; for ordered
splitting variables, only adjacent level value combinations are possible). {cmd:chaid} then takes combination {it:j} and conducts a statistical
association test (i.e., chi-square) using combination {it:j} (i.e., omitting all other combinations, using combination {it:j} as a dummy
code/indicator) on the response variable to determine the probability (i.e., p-value) of the association between the response variable and the
categories separated by combination {it:j} being 0. The process of obtaining p-values from a chi-square test continues for all {it:k} combinations
of splitting variable {it:i}. After all {it:k} combinations have obtained a p-value, {cmd:chaid} finds the combination {it:j} that has the highest
p-value/weakest association with the response variable. If the magnitude of the p-value is higher than a user-defined threshold, combination
{it:j} are merged into a single category. To add a little more detail, high p-values in the merging step indicate a high probability that the
categories separated by combination {it:j} are independent of the response variable, would not likely produce a meaningful split, and can
effectively be merged together. After merging the categories with the highest p-value, {cmd:chaid} then begins again, deciphering, again for
splitting variable {it:i}, the number of valid combinations {it:k} with the formerly separate categories now treated as a single category.
The merging process continues for splitting variable {it:i} until the combination {it:j} with the highest p-value does not pass the user defined
threshold for merging or there are 2 levels remaining in the optimally merged splitting variable. When the p-value for combination {it:j} does
not pass the thnreshold for merging, {cmd:chaid} remembers the optimal merging of levels associated with splitting variable {it:i} and moves to
the next splitting variable. The optimal merging stage continues until all {it:p} splitting variables have an optimally merged structure.
{cmd:chaid} then moves to step 2.
{pstd}
Following the merging step, chaid moves to the second, splitting step. {cmd:chaid} uses the optimally merged structure to decide which variable,
if any, should be used to split the data into clusters/nodes. For all {it:p} optimally merged splitting variables, {cmd:chaid} conducts a chi-square
association test of that splitting variable {it:i} on the response variable and records the p-value. Owing to the multiple comparisons, the
p-value in the splitting step is Bonferroni adjusted according to splitting variable type. After optaining all {it:p} adjusted p-values,
{cmd:chaid} finds the smallest and, if the p-value threshold and other splitting criteria are met, the data are split into clusters/nodes based on
the splitting variable with the smallest adjusted p-value. Besides the p-value threshold, {cmd:chaid} imposes several other criteria on splits.
In particular, the clusters/nodes must be of a certian size. For example, if the minimum cluster/node size is 100, and {cmd:chaid} finds a split
in the data that would produce 3 clusters of size 175, 120, and 80 - the split will be prevented as one of the possible splits is smaller than the
minimum. When {cmd:chaid} finds a split that will be smaller than the minimum size, all the observations in the data associated with that potential
split are removed from consideration of any remaining splits and the algorithm moves to other observations that are still "splitable". If
{cmd:chaid} produces a split in the data, or decides to move along to other splitable observations, it begins again at Step 1 and forgets the
optimal merging it produced in the previous step. Moreover, {cmd:chaid} now searches {bf:within} each cluster for another optimal merger set and
potential split. Thus, when a split occurs, all subsequent merging and splitting are contingent on previous splits and the dataset is "partitioned"
conditional on the previous splits.
{pstd}
The {cmd:chaid} algorithm stops when a] the minimum cluster/node size is not met for potential splits across all clusters, b] the minimum cluster/node
size is not met for splitting across all cluters (i.e., when {cmd:chaid} moves from Step 2 back to Step 1, the size of the cluster is below the
minimum size allowed to attempt a split, c] the cluster/node is pure (i.e., composed of single value of response variable), d] the maximum number
of branches/contingencies is met, and e] a cluster has a single value for each splitting variable (i.e., no splits possible).
When the 5 criteria are met across all observations/clusters/nodes in the data, {cmd:chaid} stops and reports results.
{pstd}
Traditionally, a third step occurring during Step 1 above involving attempting to "re-split" a merged combination of the levels of a splitting
variable is attempted. Specifically, after a merger between combination {it:j} occurs, if combination {it:j} contains 3 or more levels of
splitting variable {it:i}, {cmd:chaid} attempts a "Step 2-like" split in which {cmd:chaid} searches through all potential binary splits (i.e.,
splits into 2 categories) of the 3 or more optimally merged categories. If the used-defined threshold is met, the split is carried out and the
newly split set of levels is substituted into the merging levels set in Step 1. The "re-split" step is implemented in {cmd:chaid} as an option
(when {opt respalpha()} receives a positive probability value). The re-splitting step is necessary to reach near optimal results, but can greatly
increase run time and as Kass (1980) notes "In practice a merger is rarely split,..." (p. 121). Hence, the user can choose to invoke or ignore
this step.
{pstd}
The current implementation of {cmd:chaid} differs from the traditional use of contingency tables in that it uses logistic models to
estimate chi-square values and, as such, may require somewhat larger sample sizes than do other implementations of {cmd:chaid} for the
{cmd:ml} algorithm to converge. The default estimation method for {cmd:chaid} is {help mlogit}. The use of Stata's {cmd:ml} based commands
greatly increases the flexibility of the kinds of data {cmd:chaid} can accomodate. {cmd:chaid} also uses {help levelsof} to separate out levels
of splitting and response variables. Consequently, all splitting variables and the response variable must non-negative integers and, for the
current implementation, must have fewer than 21 levels or unique values.
{pstd}
{cmd:chaid} generates, by default, a variable named {cmd:_CHAID} after finishing execution that indicates each observation's
membership in a cluster as defined by {cmd:chaid}. If you have a variable named {cmd:_CHAID} already in your dataset, {cmd:chaid}
will overwrite it with the new value of {cmd:_CHAID} based on the current cluster membership.
{title:Display}
{pstd}
The decision tree structure is returned in two forms 1) as a matrix that is read from top-down, and 2) as a graph which shows the hierarchical
structure of the {cmd:chaid} tree including the split contingencies. For example consider the results from example #1 below (i.e., the below
matrix is exactly what {cmd:chaid} reports):
{res}Chi-Square Automated Interaction Detection (CHAID) Tree Branching Results
{txt}{hline 80}
{col 20}1{col 34}2{col 48}3{col 62}4
{txt}{col 6}{c TLC}{hline 57}{c TRC}
{col 4}1{col 6}{c |}{col 11}{res}xtlength@1{col 25}xtlength@2{col 39}xtlength@2{col 53}xtlength@3{col 64}{txt}{c |}
{col 4}2{col 6}{c |}{res}{col 24}rep78@1 3 2{col 40}rep78@4 5{col 64}{txt}{c |}
{col 4}3{col 6}{c |}{res}{col 11}{res}Cluster #1{col 25}Cluster #2{col 39}Cluster #4{col 53}Cluster #3{col 64}{txt}{c |}
{txt}{col 6}{c BLC}{hline 57}{c BRC}
{pstd}
The results show that 4 clusters were uncovered from the data (the clusters are out of order due to the way in which the algorithm
searches for splits) - each column represents a distinct cluster uncovered by {cmd:chaid}. The first split is represented in the first
row. Each column in the first row has a variable name, an ampersat sign, and some valid values of the variable. The variable name represents
the variable on which the first split occurred for the {cmd:chaid} algorithm - in this case the {help xtile}d length variable from the auto
dataset. The values following the ampersat signs refer to the values chosen through the optimal merging steps described above of the variable.
In this case, all valid values were deemed by {cmd:chaid} to be different from one another and were not merged. Thus, split #1 was on the lentile
variable into levels 1, 2, and 3. The dataset was then partitioned based on that split and the remaining splits are then conditional on the first
split. All successful {cmd:chaid} runs will have a populated first row.
{pstd}
Notice that Clusters #2 and #4 have repeated values of xtlength@2. This is because another split occurred within xtlength@2, but not for xtlength@1
or xtlength@3 - which is why each only has an entry at the first row. The split on xtlength@2 is represented in the second row for the rep78
variable, splitting into optimally merged 1 3 2 and 4 5 groups. Thus, given a observation was in xtlength category 2, one additional split was
possible putting observations into rep78 1 3 and 2 versus 4 and 5. The rows "lower" on the matrix are then response on the rows "higher"
in the matrix. Furthermore, the number of rows represents the number of "branches" the decision tree has. Finally, each column represents each
cluster's unique "path" leading from the first split in row 1 to the final populated row of the column/cluster in question.
{pstd}
The graph that is returned by {cmd:chaid} can be cluttered with value labels. If unreadable, consider using Stata's
{help graph_editor:graph editor} to alter the size, angle, and location of the labels. Currently, graph options are not available to alter the
appearence of the graph using Stata syntax directly.
{marker options}{...}
{title:Options}
{phang}{opt minnode()} specifies the minimum number of observations allowed in a terminal cluster or "node." If an
optimally merged splitting varible passes to the point of splitting, but one or more of the clusters that would be created
by the optimally merged splitting variable is below the {opt minnode()} value, the split will not be carried out. Thus,
{opt minnode()} prevents {cmd:chaid} from carrying out Step 2. The default value for {opt minnode()} is 100.
{phang}{opt minsplit()} specifies the minimum number of observations across all levels of an optimally merged splitting
variable to allow any split to occur, irrespective of how many observations would be in each of the final clusters. Hence,
previous to clustering, if a the number of observations does not meet the {opt minsplit()} minimum, no clustering will occur.
{opt minsplit()} prevents {cmd:chaid} from executing Step 1. The default value for {opt minsplit()} is 200.
{phang}{opt unordered()} treats the variable list included as "unordered." The unordered treatment affects both how
categories of the splitting variables are combined (i.e., any categories can be combined), and how final significance of
the optimally merged splitting variable is computed (i.e., unordered Bonferroni adjustment). Unordered predictors are
the most time intensive splitting variables for which to discern an optimal merging, as all categories are potentially
allowed to be merged.
{phang}{opt ordered()} treats the variable list included as "ordered." The ordered treatment affects both how
categories of the splitting variables are combined (i.e., only adjacent categories can be combined), and how final
significance of the optimally merged splitting variable is computed (i.e., ordered Bonferroni adjustment). Ordered
variables with {opt missing} invoked, when missing data are present, treat the missing category as a "floating" option
and, consequently, adjusts the p-value for splitting based on the floating category as opposed to the usual ordered
adjustment.
{phang}{opt noisily} turns on the tracing of the command through split-related data management and estimation runs. {opt noisly}
can reassure the user that {cmd:chaid} is running or, can show the user the entire process of determining the splits arrived at by
{cmd:chaid}.
{phang}{opt missing} allows missing data in the response and splitting variables to be treated as another category. The
missing option does not affect unordered splitting variables, but does result in all ordered indepdent variables with missing
data, to have a "floating" missing data category. Note that with the {opt dvordered} option that missing data on the response
variable is not treated as another category but, rather, is treated as missing and marked out of the sample. Missing data with
ordered response variables must be imputed through some other means to be included in {cmd:chaid}. The {cmd:missing} option
can also result in missing (i.e., ".") as a predicted value when combined with {cmd:predicted}.
{phang}{opt mergalpha()} sets the alpha level at which an allowable pair of categories in an splitting variable can be
merged in the merging step. {opt mergalpha()} refers to the probability values the chi-square would take on that would
{it:allow} a merge. The default value is .95. Thus, p-values that are among the middle 95% of the probability
density for the chi-squqare distribution are merged. In other words, predictors that pass the p > .05 threshold are
allowed to be merged.
{phang}{opt respalpha()} sets the alpha level at which an already optimally merged set of 3 or more of an independnet variable's
original categories will be allowed to de-couple into a binary split. How the splitting variables are split depends, of course,
on the splitting variable's type and whether the {opt missing} option is invoked. {opt respalpha()} refers to the probability
values the chi-square would take on that would {it:allow} a split. The default setting for this option is to a negative value, which
shuts this option off. If the user chooses to invoke the {opt resplapha()} option, the value for this option must be smaller than
the value of 1 - {opt mergalpha()}.
{phang}{opt spltalpha()} sets the adjusted alpha level at which an optimally merged predictor will be split following the
merging step. As compared to {opt mergalpha()}, {opt spltalpha()} works in the same way, save that small p-values will
produce a split - just as small p-values are deemed to be statistically significant. The default value is .05. Thus,
predictors that pass the p < .05 threshold are allowed to be split.
{phang}{opt maxbranch()} sets the maximum number of "branches" on which the {cmd:chaid} tree is allowed to take. In other words,
{opt maxbranch()} sets a maximum in the number of splits on which a particular terminal cluster can be based. The default
value is for there to be {it:no} maximum in terms of the number of splits. Any negative value will turn off
{opt maxbranch()}.
{phang}{opt dvordered} changes the estimation procedure underlying {cmd:chaid} to be {help ologit} instead of the default
{help mlogit}. Hence, {opt dvordered} treats the response variable as being ordered.
{phang}{opt noadj} prevents {cmd:chaid} from Bonferroni adjusting the {opt spltalpha()} p-values to prevent potential Type I/"false positive"
inferential error. {opt noadj} is useful in cases where the user desires to be more lenient in terms of uncovering
relationships in the data. {opt noadj} is turned on by default when using {opt permute} as the Bonferroni adjustment is
intended to approximate exact/permutation p-values (see Kass, 1980).
{phang}{opt nodisp} prevents {cmd:chaid} from displaying the decision tree structure and graph produced by the algorithm.
{phang}{opt predicted} produces a predicted value for the response variable for each cluster. The predicted value is the mode of the response
variable for that cluster. When {opt predicted} is invoked, {cmd:chaid} produces a variable called {cmd:CHAID_predict} containing the predicted
value by default. Any variable named {cmd:CHAID_predict} in the dataset, including those from previous runs of {cmd:chaid}, will be overwritten.
{phang}{opt importance} produces a permutation importance matrix for rank ordering the splitting variables. The permutation importance approach
derives from the idea that if one randomly permutes the values of any one splitting variable there should be a decrease in fit if that splitting
variable predicts the response variable. Larger decreases in fit based on permutations of a splitting variable indicate that a splitting
variable is better at predicting the response variable than smaller decreases in fit. All fit scores are computed based on Cram{c e'}r's V
(see {help tabulate_twoway:tabulate twoway}), which obtains perfect fit when all the clusters have a single value (i.e., are pure) or a similar
metric when combined with {opt svy}. The decreases in fit owing to permutation are normed out of 1 and displayed in the "raw" row of
{cmd:e(importance)}. Because it is possible to obtain negative decrements to fit (that is, increases in fit) due to permuting the values of a
splitting variable, and due to the random nature of the permutation, the values in the "raw" row are not particularly meaningful and only the the
second row, "rank", of {cmd:e(importance)} should be interpreted. Note that {opt importance} creates a Mata matrix called "ifstmnts" that it uses
to conduct the permutation importance computations that the user can access.
{phang}{opt xtile()} is a convenience command to take variables that are otherwise continuous, or ordered with many categories (i.e., counts),
and generate a set of ordered categorical quantiles from those variables. The variables created by {opt xtile()} are automatically considered
as {opt ordered()} and are added to the dataset with the name "xt{it:varname}". Any variables called "xt{it:varname}" already in the dataset will
be overwritten when using {opt xtile()}. The {opt xtile()} option allows the user specify the number of quantiles created as an option
(i.e., {opt nquantiles(#)}) following a comma. Not specifying the number of quantiles will result in 2 quantiles (i.e., the default
for {cmd:xtile}).
{phang}{opt permute} changes the way in which p-values are computed for splitting and merging steps from the traditional large-sample approximation
to a monte-carlo permutation test-based approach (see {help permute} for details). Permutation tests are more approproate for smaller sample sizes
(i.e., conditions under which large sample approximations will not hold) and tend to be more conservative than the large sample approximation and,
thus, tend to produce fewer splits than the same model not using {opt permute}. {opt permute}, owing to its monte carlo-based approach, greatly
increases {cmd:chaid}'s run time. Permuation tests also do not require Bonferroni adjustment and thus imply {opt noadj}. {opt permute} does not
work with {opt svy} or {help weight}s.
{phang}{opt svy} incorporates {help svyset} complex survey design characterisitics into the p-value computation. As with other {help svy} commands,
the data must be {help svyset} previous to running {cmd:chaid}. Option {opt svy} and {help weight}s cannot be used together. Additionally,
option {opt svy} and {opt permute} cannot be used together.
{phang}{opt exhaust} implements the exhaustive CHAID algorithm described by Biggs, de Ville, and Suen (1991). The only difference between exhaustive
and traditional CHAID is that exhaustive CHAID continues the merging step until only a 2 categories/a binary split remains. Thus, exhaustive
CHAID ignores {opt mergalpha()} and will continue to merge categories until 2 optimally merged categioies of the original set if categories of
variable {it:i} remain. Option {opt exhaust} also changes the way the Bonferroni adjustment is computed as the way it merges categories differs
from the traditonal method.
{title:Remarks}
{phang}Because there is a component of randomness to the {cmd:chaid} results, the user should {help set seed} prior to using {cmd:chaid}.
Additionally, {cmd:chaid} utilizes Mata and saves a string matrix there that the user can access post estimation. One Mata matrix, called
"CHAIDsplit", contains information used to create the {cmd:e(split#)} and {cmd:e(path#)} macros. In some instances, most especially with complicated
and large CHAID trees, the strings contained in some of the {cmd:e(path#)} macros will be truncated. Users knowlegable of Mata can obtain such
information from the "CHAIDsplit" Mata matrix. To be specific, "CHAIDsplit" includes, in the first row, information regarding all the splits in
the data. The first column is a label, "splits", and each column thereafter corresponds to a split made in the data. The labels are represented as
"{it:varname value}" with no space between them. Thus, the {it:rep78} variable from the auto dataset would have 5 labels: rep781, rep782, rep783,
rep784, and rep785 - corresponding to all 5 levels of the rep78 variable. Merged categories are separated in the "splits" row by commas. Thus, any
labels not separated by a comma have been "optimally merged" by {cmd:chaid}. The remaining rows correspond to the "path" represented by each cluster.
Thus "path1" is the set of contingencies (or "and" logical statements that resulted in "Cluster #1", "path2" is the set of contingencies that
resulted in "Cluster #2", etc... using the labels as outlined above (i.e., with "{it:varname value}" format).
{phang}As with other computationally intense programs (e.g., {stata findit gllamm}), collapsing the data over identical observations and
using a {cmd:fweight} is a way to speed up estimation time for larger datasets. Given that {cmd:chaid} requires categorical data, using the
{help collapse} command is a particularly useful approach - but, again, will not work with {opt svy} or {opt permute}.
{phang}Due to numerical issues related to storage of very small p-values, {cmd:chaid} uses the Akaike information criterion (AIC; see {help estat})
to decide on splits when p-values are identical (i.e., when both are effectively 0 at ~10^-300).
{title:Introductory examples}
#1: Basic CHAID analysis with altered {opt minsplit()} and {opt minnode()}
{phang}{cmd:set seed 1234567}{p_end}
{phang}{cmd:webuse auto}{p_end}
{phang}{cmd:chaid foreign, unordered(rep78) minnode(4) minsplit(10) xtile(length, n(3))}{p_end}
#2: Basic CHAID analysis as in #1 with permutation tests
{phang}{cmd:chaid foreign, unordered(rep78) minnode(4) minsplit(10) xtile(length, n(3)) permute}{p_end}
#3: Larger-scale CHAID with ordered response variable and permutation importance
{phang}{cmd:webuse nhanes2f, clear}{p_end}
{phang}{cmd:chaid health, dvordered unordered(region race) ordered(houssiz sizplace diabetes sex smsa heartatk) importance}{p_end}
#4: Larger-scale CHAID with ordered response variable; collapsed using fweight
{phang}{cmd:preserve}{p_end}
{phang}{cmd:generate byte fwgt = 1}{p_end}
{phang}{cmd:collapse (sum) fwgt, by(health region race houssiz sizplace diabetes sex smsa heartatk)}{p_end}
{phang}{cmd:chaid health [fweight = fwgt], dvordered unordered(region race) ordered(houssiz sizplace diabetes sex smsa heartatk)}{p_end}
{phang}{cmd:restore}{p_end}
#5: Exhaustive CHAID with complex survey design
{phang}{cmd:svyset psuid [pweight=finalwgt], strata(stratid)}{p_end}
{phang}{cmd:chaid health, dvordered unordered(region race) ordered(houssiz sizplace diabetes sex smsa heartatk) svy exhaust}{p_end}
{title:Saved results}
{phang}{cmd:chaid} saves the following results to {cmd: e()}:
{synoptset 16 tabbed}{...}
{p2col 5 15 19 2: scalars}{p_end}
{synopt:{cmd:e(N)}}number of observations{p_end}
{synopt:{cmd:e(N_clusters)}}number of clusters created by {cmd:chaid} and returned in {cmd:_CHAID} variable{p_end}
{synopt:{cmd:e(fit)}}purity of the clusters (extent to which each cluster has only a single value of the response variable);
based on Cram{c e'}r's V{p_end}
{p2col 5 15 19 2: macros}{p_end}
{synopt:{cmd:e(cmdline)}}command as typed{p_end}
{synopt:{cmd:e(cmd)}}{cmd:chaid}{p_end}
{synopt:{cmd:e(title)}}title in estimation output{p_end}
{synopt:{cmd:e(path#)}}displays the levels of each split leading to cluster #. Each split is separated by semicolons.
The splitting variable is first followed by an "at" sign (i.e., @) followed by the levels of the splitting variable that
describes cluster # at that split{p_end}
{synopt:{cmd:e(split#)}}displays the #th split by {cmd:chaid}. Splitting variable is displayed first followed
by the optimal mergers of levels of the splitting variable. Each merged set of splitting variable levels is separated
by parentheses{p_end}
{synopt:{cmd:e(depvar)}}name of dependent/response variable{p_end}
{p2col 5 15 19 2: matrices}{p_end}
{synopt:{cmd:e(importance)}}permutation importance matrix{p_end}
{synopt:{cmd:e(sizes)}}sample size of each cluster{p_end}
{synopt:{cmd:e(branches)}}number of branches from the root node for each cluster{p_end}
{p2col 5 15 19 2: functions}{p_end}
{synopt:{cmd:e(sample)}}marks estimation sample{p_end}
{title:References}
{p 4 8 2}Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. {it:Applied Statistics, 29, 2}, 119-127.{p_end}
{p 4 8 2}Biggs, D., de Ville, B., and Suen, E. (1991). A method of choosing multiway partitions for classification and decision trees. {it:Journal of Applied Statistics, 18}, 49-62.
{title:Author}
{p 4}Joseph N. Luchman{p_end}
{p 4}Behavioral Statistics Lead{p_end}
{p 4}Fors Marsh Group LLC{p_end}
{p 4}Arlington, VA{p_end}
{p 4}jluchman@forsmarshgroup.com{p_end}
{title:Acknowledgements}
{phang}Thanks to Florian Zettelmeyer, Paul Bergmann, and Kimberly Wilson for bug reporting.
Thanks also to Jonathan Mendelson and Luke Viera for comments on the functionality of and suggestions about features for {cmd:chaid}.
{title:Also see}
{psee}
{manhelp mlogit R}, {manhelp ologit R}, {manhelp levelsof R}, {manhelp tabulate_twoway R:tabulate twoway}, {manhelp svy R}, {manhelp permute R},
{manhelp xtile R}.
{p_end}