Feature Selection - Statistica General Discussion - Statistica - Dell Community

Feature Selection

Feature Selection

This question is answered

Hello 


I would use Feature Selection in Data Mining but I don't understand how is calculated chi-square.

I have : 1 dependent varibales categorial(X) and many predictors continous(Y).

I think this function create a contingency table with X and Y (for each predictor Y_i) and determine if  X and Y are independant ( = null hypothesis).

But I 'm not sure 

Can you explain this function please? 

Thanks for your response 

Best

Verified Answer
  • Your summary seems to be lined up with the algorithm of  Chi-square test in the feature selection. The F and P-value from  Pearson chi-square is used to access the prediction performance of predictors in feature selection.

    More computation details can be found : Chi-Square Test and our Text book about Pearson Chi-square

All Replies
  • Here is our online help details for Feature Selection and Variable Screening - Computational Details.

    For classification problems, the program will compute a Chi-square statistic and p value for each predictor variable. For continuous predictors, the program will divide the range of values in each predictor into k intervals [10 intervals by default;  to "fine-tune" the sensitivity of the algorithm to different types of monotone and/or non-monotone relationships, this value can be changed by the user on the Feature Selection and Variable Screening Startup Panel]. Categorical predictors will not be transformed in any way.

    Options are available on the FSL Results dialog to sort the list of Chi-square and p values representing each predictor, to review the best predictors using either the Chi-square or p value as the criterion of predictor importance.

    Does this answer your question?

  • I already read this online help but I would like to know if I understand how is calculated chi-squared?

    Feature Selection made it : 

    ____________________________________________________________________________________________________________________

    Suppose that Variable X has r levels ( variables dependent categorial),

    and Variable Y ( one of my predictor, continous) has k levels (calculated by the function Feature Selection).

    The null hypothesis states that knowing the level of Variable X does not help you predict the level of Variable Y. That is, the variables are independent.

    H0: Variable X and Variable Y are independent. 
    H1: Variable X and Variable Y are not independent.

    The test statistic is a chi-square random variable (Χ2) defined by the following equation.

    Χ2 = Σ [ (Or,k - Er,k)2 / Er,k ]

    where Or,k is the observed frequency count at level r of Variable X and level k of Variable Y, and Er,k is the expected frequency count at level r of Variable X and level k of Variable Y.

    Since the P-value (P(Χ2 > 3.84)) is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between X ( r categories) and Y (my feature/predictor). So, higher value chi-square imply higher performance of predictor to identify my target/categorie X_r.

    ___________________________________________________________________________________________________________________

    Thank you in advance for your answer. 

  • Your summary seems to be lined up with the algorithm of  Chi-square test in the feature selection. The F and P-value from  Pearson chi-square is used to access the prediction performance of predictors in feature selection.

    More computation details can be found : Chi-Square Test and our Text book about Pearson Chi-square

  •  Thank you very much for your answer and for your help. 

    Best regards