CHI-SQUARE AUTOMATIC INTERACTION DETECTION(CHAID)

  1. Popular decision tree algorithm technique
  2. Chi-Square test of independence for splitting was used initially
  3. Partitions data into mutually exclusive, exhaustive, subsets that describes the dependent categorical variables.
  4. It is an interactive procedure that examines the predictors(Classification variables) and uses them in the order of their statistical significance

Splitting Rules

  1. Based on the type of dependent variable(response/target)
  2. For Continuous dependent variable F test is used
  3. For Categorical dependent variable Chi-Square test of independence is used

 

CHAID PROCEDURE

  • Step 1: Examine each predictor(independent) variable for its statistical significance with dependent variable
  • Step 2: a) Use F-test if dependent variable is continuous.
  •                 b) Use Chi-Square Test if dependent variable is Categorical/qualitative
  • Step 3: Determine the most significant among the predictors
  •                 ( predictors with smallest p-value after Bonferroni correction)
  • Step 4: Divide the data by levels of the most Significant predictor. Each of the groups will be examined individually further
  • Step 5: For each sub-group determine the most significant variable from the remaining predictor and divide the data again
  • Step 6: Repeat step 5 till stopping criteria is reached.

CHAID besides splitting it also supports merging

In merging least significantly different groups are merged to form one class. This is done to reduce the variable

CHAID STOPPING CRITERIA

  1. Maximum tree depth is reached (depth is pre-defined)
  2.  Minimum number of cases to be a parent node is reached (this also pre-defined) 100
  3. Minimum number of cases to be child node is reached.50

CHAID INPUT

  1. Significance level for partitioning a variable
  2. Significance level for merging variables
  3. Minimum number of records for the cell

Using SPSS

CHAID – Chi-square Automatic Interaction Detection method uses Chi-square tests to identify optimal splits while building Decision tree. CHAID compares the cross tabulation between input fields and the outcome. (contingency table). It uses the chi-square independence test to determine if there is a significant association/relationship between two variables. If there is a significant relationship between the variables, select the particular feature/attribute which comes with smallest p-value. CHAID merges categories that show  no difference in the outcome. Continues this process till all remaining categories differ at a specified  testing level (0.05) 5%). SPSS automatically generates models that classify and predict outcomes with great accuracy. Decision trees provide clarity and transparency, making the model easier to explain to stakeholders.

CHAID is a predictive model. Can be used to forecast scenarios and draw conclusions. Can be used in a variety of fields like market segmentation, brand tracking and new product development

A statistically significant result from a chi-square test indicates that the two variables are not independent, and there is a relationship between them.

Decision Tree is a classification model that works by recursively splitting data into subsets based on specific decision rules. Each internal node in the tree represents a “decision point,” and the branches indicate different outcomes based on the chosen decision rule. The terminal nodes (or leaves) represent the final classification or prediction outcomes.