The world's top machine learning Scikit-learn Chinese community undertaken by CDA is now online!

2021-08-17

As a well-known domestic full-stack data science education and certification brand, CDA has been committed to making quality education available to everyone. Scikit-learn, as an introductory tool library for machine learning, is deeply loved by beginners. However, since the official document is written in English, it limits the learning process of many machine learning enthusiasts. Therefore, a professional, standardized, and real-time Scikit-learn Chinese learning community has always been an urgent need for domestic learners.

The CDA national teaching and research team has been using Scikit-learn as the main tool library for Python machine learning courses on a large scale since 2016. Whether it is a series of CDA employment classes, a weekend training course, or a series of Scikit-learn courses launched in 2018, they are very popular among domestic data science enthusiasts.

Based on the nearly 5 years of Scikit-learn course research and development experience of CDA's national teaching and research team, in order to respond to the learning needs of more and more data science enthusiasts, CDA has passed the translation and careful proofreading of Scikit-learn documents for more than a year, and in CDA With the close cooperation of the R&D department, the Scikit-learn Chinese community is finally online. From the user guide to the API, and then to the case, the number of translated words is more than one million words. Compared with other machine-translated Scikit-learn Chinese materials circulating on the Internet, the translation of the CDA Scikit-learn Chinese community is the latest official version, and The content is more comprehensive, the format is more standardized, the translation is more professional and accurate, and we strive to provide a more convenient learning path for machine learning enthusiasts. Click on the community logo below to enter the CDA Scikit-learn Chinese community! Remember to share and save! (Note: The official website of scikit-learn is www.scikit-learn.org , and the Chinese community website undertaken by CDA is www.scikit-learn.org.cn . This also marks the development of CDA with the world’s top deep learning and machine learning frameworks. Further integration, CDA certification is more recognized by the world's top technical framework!


Scikit-learn (also known as sklearn) is a free machine learning library for the Python programming language. In 2007, Scikit-learn was first developed and used by the Google Summer of Code project, and it is now widely regarded as the most popular machine learning library.

Sklearn has many advantages:

· Supports four categories of machine learning algorithms including classification, regression, dimensionality reduction and clustering. It also includes three modules of feature extraction, data processing and model evaluation, with rich API interfaces.

· The code style is clear and consistent, which makes the machine learning code easy to understand and reproduce, greatly reducing the entry barrier for machine learning.

· Supported by a large number of third-party tools, with very rich functions, suitable for various scenarios.

If you are learning and using machine learning, then Scikit-learn may be the best tool library. Scikit-learn has complete documentation, easy to use, and rich API, and is widely used by machine learning enthusiasts. It has encapsulated a large number of machine learning algorithms, and Scikit-learn has a large number of built-in data sets, which saves the time of acquiring and organizing data sets.

The following introduces some convenient methods used by Scikit-learn tool library

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

· Fitting and Forecasting: Estimator Basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can fit some data using its fitting method.

This is a simple example where we use some very basic data to train



The fitting method usually accepts 2 inputs:

· Sample matrix (or design matrix) X. The size of X is usually (n_ samples, n_ features), which means that samples are represented as rows and features are represented as columns.

· The target value y is the true number used for the regression task, or an integer (or any other discrete value) used for classification. For unsupervised learning, y does not need to be specified.

Although some estimators can use other formats (such as sparse matrices), in general, both X and y are expected to be numpy arrays or equivalent array-like data types.

After the estimator is fitted, it can be used to predict the target value of the new data. There is no need to retrain the estimator, which is very convenient:


Converter and preprocessor

The machine learning workflow usually consists of different parts. Typical pipe ( the Pipeline ) comprises a converting pretreatment step or insert data, and a final prediction is the prediction target.

In scikit-learn, the preprocessor and converter follow the same API as the estimator object (in fact they all inherit from the same BaseEstimator class). There is no prediction method for the conversion object, but a conversion method that outputs the newly converted sample matrix X is required:


Sometimes, if you want to apply different transformers to handle different characteristics: designed for these use cases. < / P > < p > · pipeline: connect the preprocessor and estimator < / P > < p > the converter and estimator (predictor) can be combined into a unified object: a Pipeline . This pipeline provides the same API as a regular estimator: it can use fit and predict to train and predict. As we will see later, using pipelines can also prevent data leakage, that is, leaking some test data in the training data.

In the following example, we load the Iris data set , divide it into a training set and a test set, and then calculate the accuracy score of the pipeline based on the test data:


· Model evaluation

Using some data to train the model does not mean that it can predict well on some unknown data, it requires direct evaluation.

Fitting a model to some data does not mean that it will predict well on unseen data. This requires direct evaluation. We just saw that the train_ test_ split function can divide the data set into a training set and a test set, but scikit-learn provides many other model evaluation tools, especially tools for cross-validation .

Here we briefly show how to use the cross_ validate helper to perform the 5-fold cross-validation process. It should be noted that different data splitting strategies and custom scoring functions can also be used to manually implement traversal. For more detailed information, please refer to our user guide :


Dynamic parameter search

All estimators have parameters that can be adjusted (usually called hyperparameters in the literature). The generalization ability of an estimator usually depends critically on several parameters. For example, in the random forest regressor RandomForestRegressor , the n_ estimators parameter determines the number of trees in the forest, and the max_ depth parameter determines the maximum depth of each tree. Usually, the exact value of these parameters is not clear to us, because they depend on the data obtained.

Scikit-learn provides tools to automatically find the best combination of parameters (through cross-validation). In the following example, we use the RandomizedSearchCV object to randomly search the parameter space of a random forest. After the search is over, RandomizedSearchCV behaves like RandomForestRegressor with the best parameter set trained. You can read more in the user guide:







Thanks for watching

Join Us

Company/Organization Name:

Company/Organization Site:

Candidate Name:

Candidate Job:

Tel:

Email:

Admission Remarks: (cause and appeal of admission)

Submit application