I would use Feature Selection in Data Mining but I don't understand how is calculated chi-square.
I have : 1 dependent varibales categorial(X) and many predictors continous(Y).
I think this function create a contingency table with X and Y (for each predictor Y_i) and determine if X and Y are independant ( = null hypothesis).
But I 'm not sure
Can you explain this function please?
Thanks for your response
Your summary seems to be lined up with the algorithm of Chi-square test in the feature selection. The F and P-value from Pearson chi-square is used to access the prediction performance of predictors in feature selection.
More computation details can be found : Chi-Square Test and our Text book about Pearson Chi-square.
Here is our online help details for Feature Selection and Variable Screening - Computational Details.
For classification problems, the program will compute a Chi-square statistic and p value for each predictor variable. For continuous predictors, the program will divide the range of values in each predictor into k intervals [10 intervals by default; to "fine-tune" the sensitivity of the algorithm to different types of monotone and/or non-monotone relationships, this value can be changed by the user on the Feature Selection and Variable Screening Startup Panel]. Categorical predictors will not be transformed in any way.
Options are available on the FSL Results dialog to sort the list of Chi-square and p values representing each predictor, to review the best predictors using either the Chi-square or p value as the criterion of predictor importance.
Does this answer your question?
I already read this online help but I would like to know if I understand how is calculated chi-squared?
Feature Selection made it :
Suppose that Variable X has r levels ( variables dependent categorial),
and Variable Y ( one of my predictor, continous) has k levels (calculated by the function Feature Selection).
The null hypothesis states that knowing the level of Variable X does not help you predict the level of Variable Y. That is, the variables are independent.
H0: Variable X and Variable Y are independent. H1: Variable X and Variable Y are not independent.
The test statistic is a chi-square random variable (Χ2) defined by the following equation.
Χ2 = Σ [ (Or,k - Er,k)2 / Er,k ]
where Or,k is the observed frequency count at level r of Variable X and level k of Variable Y, and Er,k is the expected frequency count at level r of Variable X and level k of Variable Y.
Since the P-value (P(Χ2 > 3.84)) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between X ( r categories) and Y (my feature/predictor). So, higher value chi-square imply higher performance of predictor to identify my target/categorie X_r.
Thank you in advance for your answer.
Thank you very much for your answer and for your help.