Feature Selection - Statistica General Discussion - Statistica - Dell Community

# Feature Selection

#### Feature Selection

Hello

I would use Feature Selection in Data Mining but I don't understand how is calculated chi-square.

I have : 1 dependent varibales categorial(X) and many predictors continous(Y).

I think this function create a contingency table with X and Y (for each predictor Y_i) and determine if  X and Y are independant ( = null hypothesis).

But I 'm not sure

Can you explain this function please?

Best

• Your summary seems to be lined up with the algorithm of  Chi-square test in the feature selection. The F and P-value from  Pearson chi-square is used to access the prediction performance of predictors in feature selection.

More computation details can be found : Chi-Square Test and our Text book about Pearson Chi-square

All Replies
• Here is our online help details for Feature Selection and Variable Screening - Computational Details.

For classification problems, the program will compute a Chi-square statistic and p value for each predictor variable. For continuous predictors, the program will divide the range of values in each predictor into k intervals [10 intervals by default;  to "fine-tune" the sensitivity of the algorithm to different types of monotone and/or non-monotone relationships, this value can be changed by the user on the Feature Selection and Variable Screening Startup Panel]. Categorical predictors will not be transformed in any way.

Options are available on the FSL Results dialog to sort the list of Chi-square and p values representing each predictor, to review the best predictors using either the Chi-square or p value as the criterion of predictor importance.

____________________________________________________________________________________________________________________

Suppose that Variable X has r levels ( variables dependent categorial),

and Variable Y ( one of my predictor, continous) has k levels (calculated by the function Feature Selection).

The null hypothesis states that knowing the level of Variable X does not help you predict the level of Variable Y. That is, the variables are independent.

 H0: Variable X and Variable Y are independent. H1: Variable X and Variable Y are not independent.

The test statistic is a chi-square random variable (Χ2) defined by the following equation.

Χ2 = Σ [ (Or,k - Er,k)2 / Er,k ]

where Or,k is the observed frequency count at level r of Variable X and level k of Variable Y, and Er,k is the expected frequency count at level r of Variable X and level k of Variable Y.

Since the P-value (P(Χ2 > 3.84)) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between X ( r categories) and Y (my feature/predictor). So, higher value chi-square imply higher performance of predictor to identify my target/categorie X_r.

___________________________________________________________________________________________________________________