Return Home

CERTIFIED DATA ANALYST LEVEL III EXAMINATION OUTLINE

CERTIFIED DATA ANALYST LEVEL III EXAMINATION OUTLINE

一、Overall Objectives

CDA (Certified Data Analyst), or “CDA Data Analyst”, is a professional and authoritative international qualification certification for entire industry under the background of digital economy and the trend of artificial intelligence era. It aims to improve the public digital skills, help digital transformation of enterprises, and promote digital development of industry. [The Certified Data Analyst (CDA) Talent Industry Standard] is a scientific, professional, international talent skill guideline targeting data-related positions. CDA Exam Outline defines the examination range and key points. Candidates can refer to the outline for acquiring the needed skills and knowledge to become a professional data analyst when preparing the exam.

二、Exam Format and Structure

Exam method: four times a year (the last Saturday of March, June, September, and December), Offline written exam and computer-based exams

Exam question type: Objective choice questions (60 single choice questions, 30 multiple choice questions, 10 content related questions)

Case practical question (1 question)

Exam duration: 90 minutes (objective choice questions), 120 minutes (case practical question), 210 minutes in total

Exam scores: The final exam scores are classified into four grades: A, B, C, and D. Passing grades include A, B, and C, while D is the failing grade.

Examination requirements: Objective choice questions are closed book computer-based exams, without the need to bring calculators and other irrelevant supplies.

For case practical question, Candidates must bring their own computer to operate (installed software with data mining functions such as PYTHON, SQL, SPSS MODELER, R, SAS, WEKA, etc., the computer must have USB copy function and related decompression software to carry out the case operational analysis. The case data will be provided in a unified CSV file).

三、Knowledge Requirements

The three different mastery levels that candidates need to obtain for different types of data analysis knowledge are comprehension, competency, and application. Exam candidates should proceed with their studies based on these different knowledge requirements.

1. Comprehension: Candidates should understand and grasp key points in data analysis regulations, understand the connotation and extension of these key points, distinguish the differences and relation of these key points, and correctly elaborate each key point.

2. Competency: Candidates should master important data analysis knowledge, understand, and memorize relevant theories and methods. They must be able to logically explain data analysis knowledge based on different requirements. Candidates’ knowledgeability and competency with different types of data analysis is the key of this exam.

3. Application: Candidates should be able to demonstrate their ability to apply data analysis theory in practice while combining related tools for commercial application, and propose specific implementation procedures and the strategy to problems based on specific requirements or conditions.

四、Exam subjects

PART 1 Introduction to Data Mining (15%)

a. Summary of data mining (3%)

b. Data mining methodology (3%)

c. Basic data mining technology (4%)

d. Advanced data mining technology (5%)

PART 2 Advanced data processing and feature engineering (25%)

a. Advanced data processing (5%)

b. Feature engineering summary (2%)

c. Feature construction (3%)

d. Feature selection (5%)

e. Feature conversion (5%)

f. Feature learning (5%)

PART 3 Natural language processing and text analysis (20%)

a. Summary of Natural Language Processing (2%)

b. Word segmentation and part-of-speech tagging (4%)

c. Summary of text mining (2%)

d. Keyword extraction (4%)

e. Text unstructured data to structure (8%)

PART 4 Machine learning algorithm (40%)

a. Naive Bayes (4%)

b. Decision tree (classification tree and regression tree) (5%)

c. Neural network and deep learning(5%)

d. Support Vector Machine (4%)

e. Integration method (5%)

f. Cluster analysis (5%)

g. Association rules (4%)

h. Sequence mode (3%)

i. Model evaluation (5%)

PART 5 Machine learning practice

(Examination method of this part is case practice, not included in the proportion of objective choice questions.)

a. Automatic machine learning

b. Category imbalance

c. Semi-supervised learning

d. Model optimization

五、Exam subjects

1、Summary of data mining
[Comprehension]
Candidates should understand application of data mining in government departments and the Internet, finance, retail, medicine and other industries
[Competency]
Origin, definition and goal of data mining
Development history of data mining
[Application]
Candidates should be able to build a data mining project based on the given data
2、Data mining methodology
[Competency]
Data mining steps (field selection, data cleaning, field expansion, data coding, data mining, result presentation)
Industry standards for data mining technology (CRISP-DM and SEMMA)
[Application]
Candidates should be able to use data mining to import data in different file formats, and conduct preliminary data exploration. Content of the exploration includes descriptive statistical analysis of numeric fields, histograms (need to be connected to the target field), missing value analysis, and categorical field analysis Descriptive statistical analysis, bar graph (need to be connected to the target field, missing value analysis. The results of data exploration can be preliminary field screening.
3、Basic data mining technology
[Comprehension]
Visualization technology (Candidates should be able to use relevant tools to make visual data reports based on business problems)
[Competency]
Case-based Learning: KNN (K-Nearest Neighbor) principle
Data preparation
Calculation of distance between sample points (Manhattan Distance, City-Block Distance, Euclidean Distance)
[Application]
Candidates should be able to use KNN algorithm in data mining for classification prediction, number prediction and content recommendation. During modeling process, Candidates needs to consider the appropriate conversion of data to obtain better analysis results.
4、Advanced data mining technology
[Competency]
Function classification of data mining technology
Descriptive data mining/unsupervised data mining (association rules, sequence patterns, cluster analysis)
Predictive data mining/supervised data mining (classification, prediction)
1、Advanced data processing
[Comprehension]

Data filtering (understand how to use data filtering to establish a segmentation model to improve the prediction effect of the model)

Expansion method of internal/external data

[Competency]

Advanced filling techniques for missing values, including KNN filling and XGBoosting filling

Advanced data conversion technology, including data generalization, data trend discretization

[Application]

Able to use advanced data preprocessing technology to filter data to establish a segmentation model

Able to use advanced data preprocessing technology to detect and fill missing values

Able to use advanced data preprocessing technology to process data generalization

Able to use advanced data preprocessing technology to process data trend discretization

Capable of evaluating the impact of the above-mentioned different data processing methods on model performance

2、Feature engineering summary
[Comprehension]

Importance of feature engineering

Feature understanding

Feature improvement (the impact of data cleaning on features)

[Competency]

Feature engineering coverage

Purpose of feature selection

Feature construction method

Feature conversion method

Automatic learning of features

Promote AI with AI

3、Feature construction
[Comprehension]

Preparation before feature construction

Feature null value processing

Standardization of features

[Competency]

Categorical feature coding

Encoding of sequential features

Binning of numerical features

Construct polynomial features

Construct interactive features

Feature normalization

[Application]

Candidates should be able to use data mining to properly construct features as input for feature selection in the next stage

4、Feature selection
[Competency]

Invalid variables (irrelevant variables, redundant variables)

Statistics-based feature selection (chi-square test, ANOVA test and T test)

Model-based variable selection (decision tree, logistic regression, random forest)

Selection of highly relevant features

Recursive feature selection

[Application]

Candidates should be able to use data mining to select key features. At the same time, the impact of different key feature selection methods on model performance is evaluated.

5、Feature conversion
[Comprehension]

Linear feature transformation-principal component analysis (PCA)

[Competency]

Non-linear feature conversion-Kernel PCA

Feature transformation to maximize the separability between classes-linear discriminant analysis (LDA)

Characteristic transformation of matrix factorization-non-negative matrix factorization (NMF)

Perform feature transformation on sparse matrix-truncated singular value decomposition (TSVD)

[Application]

Candidates should be able to use data mining for feature conversion. At the same time, evaluate the impact of different feature conversion methods on model performance.

6、Feature learning
[Competency]

Association rule-based feature learning

Neural network-based feature learning

Deep learning-based feature learning

Text feature learning based on word embedding

[Application]

Candidates should be able to use data mining for automatic feature learning. At the same time, evaluate the impact of different feature learning methods on model effectiveness.

1、Summary of Natural Language Processing
[Comprehension]

BOSON's Chinese Semantic Platform

[Competency]

Research category of natural language processing

word segmentation.

Root reduction

Part of speech tagging

Synonym tagging

Concept tagging

Role tagging

[Application]

Candidates should be able to use BOSON's Chinese semantic platform for language processing

2、Word segmentation and part-of-speech tagging
[Comprehension]

Types and meanings of parts of speech

[Competency]

N-Gram and words

Difficulties in word segmentation and part-of-speech tagging

Regular word segmentation

Statistical word segmentation

Part-of-speech tagging

[Application]

Candidates should be able to use Chinese word segmentation and part-of-speech tagging technology for word segmentation and part-of-speech tagging for multiple articles

3、Summary of text mining
[Comprehension]

Full text scanning of information retrieval technology

Signature Document of Information Retrieval Technology

Reversal of Information Retrieval Technology Item by Item

Control vocabulary

Keyword Index

[Competency]

Text mining applications

Vector Space Model of Information Retrieval Technology

The process of text mining

Text visualization

[Application]

Candidates should be able to convert multiple documents and queries into vector format, and calculate the similarity between queries and documents.

Candidates should be able to use text visualization technology to present the content of the file in a word cloud.

4、Keyword extraction
[Competency]

TF, DF and IDF

Part of speech

Keyword extraction method

[Application]

For words in multiple documents and queries, candidates should be able to calculate TF, DF, IDF and parts of speech and extract important keywords.

5、Text unstructured data to structure
[Competency]

Bag of words model

matrix decomposition

Word Embedding Model Glove

Word Embedding Model Word2Vec (Skip-Gram & CBOW)

[Application]

Candidates should be able to train and use the word embedding model for multiple documents.

Candidates should be able to apply structured documents to text classification, sentiment analysis, text clustering and text summarization.

1、Naive Bayes
[Competency]

Naive Bayes (independence assumption, normalization of probability, Laplace transform, null value problem)

[Application]

Candidates should be able to use data mining software to build a naive Bayes model, interpret the model results, and evaluate the effectiveness of the model.

2、Decision tree (classification tree and regression tree)
[Comprehension]

PRISM decision rule algorithm

CHAID decision tree algorithm (CHAID field selection method)

[Competency]

ID3 decision tree algorithm (ID3 field selection method, how to use decision tree for classification prediction, the relationship between decision tree and decision rule, the disadvantages of ID3 algorithm)

C4.5 decision tree algorithm, including C4.5 field selection method, C4.5 numerical field processing method, C4.5 null value processing method, C4.5 pruning method (pre-pruning method, pessimistic pruning Branch method)

CART decision tree algorithm (classification tree and regression tree, field selection method of CART classification tree, pruning method of CART classification tree)

CART regression tree algorithm (the field selection method of CART regression tree, how to use model tree to improve the performance of CART regression tree)

[Application]

Candidates should be able to use data mining software to build a classification tree model, interpret the model results, and evaluate the effectiveness of the model.

Candidates should be able to use data mining software to build regression tree models, interpret model results, and evaluate model effectiveness.

3、Neural network and deep learning
[Comprehension]

Overview of BP neural network (understand the origin and development of neural network)

Convolutional Neural Networks (CNN) (Understand the origin and development of Convolutional Neural Networks)

Recurrent Neural Networks (Recurrent Neural Networks, RNN) (Understand the origin and development of RNN)

[Competency]

Perceptron and the limit of perceptron

Multi-Layer Perceptron

BP neural network architecture

Composition of neurons: combination function (Combination Function) and activation function (Activation Function)

Familiar with the way BP neural network transmits information

Modified weight value and constant term

Data preparation before training model (data preparation for classification model, data preparation for prediction model)

Relationship between BP neural network and logistic regression, linear regression and nonlinear regression

[Application]

Candidates should be able to use data mining software to build a BP neural network model, interpret the model results, and evaluate the effectiveness of the model

4、Support Vector Machine
[Comprehension]

Overview of support vector machine

Linearly separable

Best linear segmentation hyperplane

Decision boundary

[Competency]

Support vector

Linear support vector machine

Non-linear transformation

Kernel function (Polynomial Kernel, Gaussian Radial Basis Function, Sigmoid Kernel)

Non-linear support vector machine

The relationship between support vector machines and neural networks

[Application]

Candidates should be able to use data mining software to build a support vector machine model, interpret the model results, and evaluate the effectiveness of the model.

5、Integration method
[Comprehension]

Overview of integration methods

[Competency]

Sampling technique

Sampling method on training data

Sampling method on input variables

Bagging method (random forest)

Lifting method (Adaboost, xgboost, GBDT, LightGBM)

[Application]

Candidates should be able to use data mining software to build a combined method model, interpret the model results, and evaluate the effectiveness of the model.

6、Cluster analysis
[Comprehension]

Concept of clustering

[Competency]

Similarity measurement (similarity measurement of binary variables, similarity measurement of mixed categorical variables and numerical variables)

Calculation of distance between sample points (Manhattan Distance, City-Block Distance, Euclidean Distance)

Clustering algorithm (Exclusive vs. Non-Exclusive (Overlapping) clustering algorithm, hierarchical clustering method, partition clustering method)

Hierarchical clustering algorithm (single link method, complete link method, average link method, center method, Ward’s method)

Partition clustering algorithm (K-Means method, EM method, K-Medoids method, neural network SOM method, two-step method)

Density Clustering Algorithm (DBSCAN)

Judgment of the number of groups (R-Squared (R2), Semi-Partial R-Squared, Root-Mean-Square Standard Deviation (RMSSTD), Silhouette Coefficient)

[Application]

Candidates should be able to use data mining software to build clustering models, interpret model results, and provide marketing advice.

7、Association rules
[Comprehension]

Concept of association rules

[Competency]

Evaluation indicators of association rules (support, confidence, promotion)

Apriori algorithm (disadvantages of the brute force method, the theoretical basis of the Apriori algorithm, the generation of candidate portfolios, the deletion of candidate portfolios)

Support and confidence issues (lift index)

Association rule generation

Extension of association rules (addition of virtual goods, negative association rules, dependency network)

[Application]

Candidates should be able to use data mining software to build association rule models, interpret model results, and provide marketing suggestions.

8、Sequence mode
[Comprehension]

Concept of sequential mode

[Competency]

Evaluation indicators of sequence mode (support, confidence)

AprioriAll algorithm (the problem of the brute force method, the theoretical basis of the AprioriAll algorithm, the generation of candidate portfolios, the deletion of candidate portfolios)

Extension of Sequence Mode (State Transfer Network)

[Application]

Candidates should be able to use data mining software to build sequential model models, interpret model results, and provide marketing advice.

9、Model evaluation
[Competency]

Confusion matrix (Accuracy, Precision, Recall, F-Measure)

KS Chart

ROC Chart

GINI Chart

Response Chart

Gain Chart

Lift Chart

Profit Chart

Average Squared Error

[Application]

Candidates should be able to use data mining software to compare the pros and cons of different models

1、Automatic machine learning
[Comprehension]

Basic concepts of automatic machine learning

Automatic machine learning platform

[Competency]

Methods of automatic data preprocessing

Model building method for automatic machine learning

Automatic model evaluation method

[Application]

Candidates should be able to use automatic machine learning technology to quickly build models, interpret model results, and evaluate model effectiveness.

2、Category imbalance
[Comprehension]

Unbalanced data definition

Unbalanced data scenario

Limitations of traditional learning methods in unbalanced data

Problems caused by unbalanced categories

[Competency]

Detection method of category imbalance problem

Over-sampling

Under-sampling

Model penalty technique

[Application]

Candidates should be able to use unbalanced processing technology to improve the performance of the model

3、Semi-supervised learning
[Comprehension]

Relationship between supervised learning, unsupervised learning and semi-supervised learning

[Competency]

Basic idea of semi-supervised learning

Basic assumptions of semi-supervised learning

Semi-supervised classification

Semi-supervised regression

Semi-supervised clustering

Semi-supervised dimensionality reduction

Master the semi-supervised learning algorithm based on SVM

Semi-supervised learning algorithm based on kernel method

EM semi-supervised learning algorithm

[Application]

Candidates should be able to use semi-supervised learning to reduce the cost of developing decision-making models

4、Model optimization
[Comprehension]

Purpose of model parameter optimization

Purpose of modeling threshold optimization

[Competency]

Method of model parameter optimization

Method of optimizing modeling threshold

[Application]

Candidates should be able to use model parameter optimization to build more accurate data mining models

Candidates should be able to use modeling threshold optimization to build more accurate data mining models

六、Recommended Reading

Notes: In the recommended study bibliography, some books are combined with software, objective choice part of the exam does not examine use of software, and practical part of the case requires candidates to use relevant software for modeling analysis. Candidates can select reading material from the list of recommended books based on their needs. Candidates do not have to read all recommended books but can study on the key points highlighted in the exam outline.

[1] Jiawei Han, Micheline Kamber, Jian Pei. Data Mining: Concepts and Techniques (3rd Edition of the Original Book) [M]. Fan Ming, Translated by Meng Xiaofeng, Mechanical Industry Press, 2012. (Required)
[2] Zhou Zhihua. Machine learning [M]. Tsinghua University Press, 2016. (Required)
[3] Chris Albon. Python Machine Learning Handbook: From Data Preprocessing to Deep Learning. Electronic Industry Press, 2019. (Required)
[4] Li Bo. Practical application of machine learning. Posts and Telecom Press, 2017. (Required)
[5] Alice Zheng, Amanda Casali. Proficient in Feature Engineering. Posts and Telecom Press, 2019. (Required)
[6] Dipanjan Sarkar. Python text analysis [M]. Mechanical Industry Press, 2018. (Required)
[7] Jingguanzhijia. SPSS Modeler+Weka data mining from entry to actual combat, Electronic Industry Press, 2019. (optional reading)
[8] Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining (Original Book 2ndEdition) [M]. Duan Lei, Zhang Tianqing Translated, Mechanical Industry Press, 2019. (Optional)
[9] Zhao Weidong, Dong Liang. Python machine learning practical case. Tsinghua University Press, 2019. (Optional)
[10] Joaf Goldberg. Natural language processing based on deep learning [M]. Mechanical Industry Press, 2018. (Optional)
[11] Lu Wei. Deep Learning Notes. Peking University Press, 2020. (Optional)
[12] Data mining website: KDnuggets (https://www.kdnuggets.com/) (Extended learning)
[13] Data mining website: Kaggle (https://www.kaggle.com/) (Extended learning)
CDA Certification Exam Committee
CDA Institute