Feature engineering is the process of creating relevant features (input variables) that are then used to train a model. The goal of feature engineering is to identify those factors that have an influence on the target variable. The quality of the input variables has a significant impact on the performance and quality of the machine learning models. There are several ways to create new input variables, including:
- The simple creation/addition of a completely new variable
- The modification of existing variables
- Drawing conclusions and information from existing variables
- Merging existing variables
During a machine learning project, a lot of the steps can be automated or done by a computer. This is not the case for feature engineering, which is a creative process at its core. It requires knowledge and acumen about the subject matter at hand. Therefore, it is the phase of a machine learning project that requires the most expertise and knowledge.
Feature engineering plays a key role in machine learning and has a significant impact on the quality of the model and its predictions. The quality of the features used usually has a greater impact on the results of a model than the choice of model type.
Features sting the complexity of the model. Basically, this means that if you have very good features, you can still achieve acceptable results even with a mediocre model, while the opposite is not true. In fact, the feature engineering phase is recognised as so important that applied machine learning is often described as essentially consisting of feature engineering, since the rest of a project, such as data preparation and model computation, is a small-scale but routine task.
Feature selection is the process of choosing those features that are relevant for use in a particular model. In contrast to feature engineering, where features are produced, feature selection is about finding out which of the available features are the most relevant and should therefore become part of the model. The goal of feature selection is not to include irrelevant features in the model. This can be done manually or with the help of algorithms that automatically select the most relevant variables.
Just because you have a large number of variables does not mean you have to use them all. In fact, adding more variables often has a detrimental rather than a positive effect on the functioning of a model. Limiting yourself to the most relevant variables reduces the likelihood of overloading a model, of collinearity and of getting into trouble with the curse of dimension; it also increases the interpretability of a model.