This technique measures the relationship between features and target variables individually. Features having high absolute correlation with target variable are considered relevant for future analysis. Identifies the features which have strong relationship with target variable. Also considers the multi-collinearity among features to avoid redundancy. The price (target variable) may be highly correlated with features like square footage and number of rooms. But they have not been included. They have been removed. Pearson Correlation : Used To measure the Linear relationship between continuous feature and target variables.Spearman Rank Correlation: Used when features are ordinal and the relationship is non-linear
Calculate the Correlation coefficient between each feature and target variables individually. Compare the absolute value of the correlation coefficients to a pre-defined threshold. Select features with correlation coefficients above the threshold.
Load Libraries
Let us use the data available in Kaggle Breast Cancer Wisconsin (Diagnostic) Data Set and reduce the features available. This dataset contains one target/dependent variable and 31 features(independent/explanatory variables)
Read Data
Unique elements of Diagnosis
Malignant – M and Benign -B
Convert categorical into numeric using Label encoder
Once data is given preprocess the data using sklearn and import LabelEncoder library. Convert the categorical column diagnosis into numeric column (1 or 0) by fit_transform function
Shape
Data contains 569 records and 31(features)+1(target) variables
Separate features and target
Convert selected features into a data frame and display
Convert the numeric values into float using numpy
Find correlation between target and individual features and compare it with threshold
Selected features out of 31 features:
Out of original 31 features system has selected 15 features based on the correlation metric. Features having more than 0.5 correlation values (threshold) have been selected for further analysis
After reduction of features let us find accuracy of Logistic regression model undertaken
Accuracy: using Logistic Regression
Correlated Features – heatmap
Heatmap using seaborn
Correlation for 15 features
Ordinary Least Square Method (OLS)
OLS Summary