Data analysis models have different characteristics and techniques, and it is worth noting that most advanced models are based on a few basic principles.
Which models should you learn when you want to start a career as a data scientist? In this article we present 6 models that are widely used in the industry.
There is a lot of public opinion about machine learning and artificial intelligence, and when you want to build predictive models, it makes you wonder if only very advanced techniques can solve the problem.
But when you try programming yourself, you realize that this is not actually the case. As a data worker, many of the problems you face require the combination of several models to solve, and most of these models have been around for a long time.
And, even if you are going to use advanced models to solve a problem, learning the fundamentals will give you a head start in most cases. At the same time, understanding the strengths and weaknesses of these fundamental models will help you succeed in your data analysis projects.
Let's take a look at 6 specific predictive models that all data analysts should master.
01 Linear regression
One of the more classic models, linear regression was used by British scientist Francis Galton in the 19th century, and remains one of the most effective models for representing linear relationships using data.
Linear regression is a staple of many econometrics courses, worldwide. Learning this linear model will give you direction in solving regression problems and understanding how to use mathematical knowledge to predict phenomena.
There are other benefits to learning linear regression, especially when you learn two methods that give the best performance.
- Closed-form solution A magic formula that gives the weights of the variables through a simple algebraic equation.
- Gradient descent An optimization method oriented to the best value of the weights for optimizing other types of algorithms.
In addition, we can visualize linear regression in practice with a simple two-dimensional diagram, which makes the model a good start to understanding the algorithm.
02 Logistic regression
Although named regression, logistic regression is the best model for mastering classification problems.
Learning logistic regression has several advantages.
Initial understanding of classification and multiclassification problems, which are an important part of the machine learning task
Understanding function transformations, such as those of the Sigmoid function
Understanding the use of other functions of gradient descent and how to optimize the functions
Initial understanding of the Log-Loss function
What is the use of logistic regression after learning it? You will be able to understand the mechanisms behind classification problems and how you can use machine learning to separate categories.
Problems that fall into this area are as follows.
Understanding whether a transaction is fraudulent
Understanding whether a customer will be lost
Classifying loans based on the probability of default
Just like linear regression, logistic regression is also a linear algorithm. After studying both algorithms, you will understand the main limitations behind linear algorithms, while recognizing that they do not represent many real-world complexities.
03 Decision Trees
The first nonlinear algorithm to study should be the decision tree. Decision trees are relatively simple and interpretable algorithms based on if-else rules that will give you a good grasp of nonlinear algorithms and their advantages and disadvantages.
Decision trees are the basis for all tree-based models, and by learning them you will also be prepared to learn other techniques, such as XGBoost or LightGBM.
Moreover, decision trees are applicable to both regression and classification problems with minimal differences between the two. The fundamentals of choosing the best variables to influence the results are roughly the same, you just do it by a different standard.
Although you understand the concept of hyperparameters in regression, such as regularization parameters, this is extremely important in decision trees to help you clearly distinguish between good and bad models.
Also, hyperparameters are crucial in learning machine learning, and decision trees can be a good test for them.
04 Random Forest
Decision trees have rather limited results due to their sensitivity to hyperparameters and simple assumptions. When you dig deeper, you will understand that decision trees can easily be over-fitted and thus yield models that lack generalization to the future.
The concept of random forests is very simple. It helps to diversify between different decision trees, thus improving the robustness of the algorithm.
Just like decision trees, you can configure a large number of hyperparameters to enhance the performance of such integrated models. Integration (bagging) is a very important concept in machine learning, bringing stability to different models, i.e. using averaging or voting mechanisms to translate the results of different models into a single approach.
In practice, a random forest trains a fixed number of decision trees and averages the results of all these previous models. Just like decision trees, we have classification and regression random forests. If you have heard of the concept of "group wisdom", then integrated models are the equivalent of applying this concept to machine model training.
05 XGBoost/LightGBM
Other models that are based on decision tree algorithms and bring stability are XGBoost or LightGBM. not only enhance the algorithm but also provide more robust and generalized models.
The wave of thinking about machine learning models gained attention after Michael Kearns published his paper on weak learners and hypothesis testing. It was shown therein that augmented models are an excellent solution to the overall trade-off bias and variance that models suffer from. Moreover, these models were the most popular choices in the Kaggle competition.
06 Artificial Neural Networks
Finally, there is the king of the current prediction models - Artificial Neural Networks (ANNs).
Artificial neural networks are one of the best models available to find nonlinear patterns in the data and to create truly complex relationships between the independent and dependent variables. By learning artificial neural networks, you will be exposed to the concepts of activation functions, back propagation and neural network layers, which should provide you with a good foundation for studying deep learning models.
In addition, neural networks have many different structural features, and learning the most basic neural networks will provide a foundation for moving to other types of models, such as natural language processing and recurrent neural networks, which are primarily used for natural language processing, and convolutional neural networks, which are primarily used for computer vision.
Conclusion.
That's all for today. Mastering these models should get you off to a good start in data analysis and machine learning.