Introduction To Dimensionality Reduction
Almost all machine learning models will suffer from the curse of dimensionality, so today's post is dedicated to the technique of dimensionality reduction.
We will go over what exactly the curse of dimensionality is and how it affects the performance of your models as well as practical ways to eliminate it.
Contents
- Curse of dimensionality
- Feature engineering
- Feature selection
- Other techniques
Curse Of Dimensionality (COD)
To put it simply COD happens when a dataset that you are using for your ML problem has so many features that the performance of the model falls.
Generally, this happens when you have 100 or more features however this again is relative to the specific problem your working on.
Later in the article, we will discuss the two main ways of eliminating the curse of dimensionality, feature engineering/ feature extraction and feature selection. These two techniques come under the idea of dimensionality reduction.
As the number of features increases the trained model becomes more and more complex and data being less dense, leading to higher training and predicting times as well as lowered accuracy due to features that do not correlate to target variables.
As you can see from the figure above, when there is only one feature it is very easy to find a pattern in the data but when we move into three features the model gets much more complex. You cant imagine how complex it gets with over a hundred features.
Feature Engineering
Feature engineering is a common term in ML that means taking multiple features and manipulating them into a lower dimension (fewer features). The two most common algorithms for this are PCA and T-NSE which will now be explained.
Principal component analysis (PCA) is probably the most popular and is a feature extraction technique that takes a set of possibly correlated features and converts them to uncorrelated linear variables (principal components). You can find out how to use it with python here.
It's important to note the PCA will work best if your data is linearly distributed.
Next, we have t distributed stochastic neighbour embedding (T-NSE). This is a slightly less popular technique for feature engineering and data visualisation, it looks for similarities between multiple features and puts them into one cluster (seen in the figure below). You can also find out how to use it using python here.
There is however a big problem with T-NSE, it is very computationally expensive especially in high dimensions and it can take a very long time to compute. It is also worth mentioning that T-NSE can only reduce to up to three features.
Feature Selection
Feature selection does exactly what's in the name, you use a technique to select which features you want to keep for your model and which ones you want to throw out. Some features clearly won't provide the model with any useful information so they can be chucked out manually, examples of these include names and IDs.
Feature selection algorithms can be sectioned off into three types:
- Wraper-based: These select a set of features and consider that set as a search problem.
- Filter-based: Here the algorithms are passed a metric such as correlation or chi-square and a threshold value. Then a set of features that meets that threshold are selected.
- Embedded: Some models have a built-in feature selection, some of these include Lasso and RF.
Let's go over 2 of the more popular algorithms for feature selection.
- Recursive feature elimination (RFE) —This is a wrapper-based algorithm and is one of the most popular. RFE works by searching for a subset of features starting with the whole training dataset and successfully removes features until the desired number remains. The python example can be found here.
- Chi-squared — This is another popular algorithm but this one is filter-based. To define chi-squared, in statistics, it is a test to see the independence of two variables (formula can be found below). So this algorithm simply selects the features with the highest chi-squared values when tested on the target variable as this means they are dependent on each other. The python example can be found here.
Other Techniques
Here is a shortlist of some other techniques that I think should be mentioned:
- Missing values — If a feature has a large proportion of missing values then it won't be very useful and should be removed.
- Low variance — When the values of a feature have a very low variance e.g. 95% of the values are 1 and only 5% are 0 then this feature should be removed.
- High correlation — If two features are very strongly correlated then removing one of them (or merging them) will reduce the dimensionality without loss of information.
Summary
As you have learnt dimensionality reduction is a very important concept that can drastically alter the performance of your model. This article was only an introduction and a reference point for the topic and should most definitely be explored further.
To conclude I would like to list some disadvantages of dimensionality reduction:
- There may be some information lost.
- Sometimes linear correlations between variables found by PCA can be undesirable.
- We may not know what number of features is the optimum amount and can only be found through experimentation.