CDA (Certified Data Analyst), or “CDA Data Analyst”, is a professional and authoritative international qualification certification for entire industry under the background of digital economy and the trend of artificial intelligence era. It aims to improve the public digital skills, help digital transformation of enterprises, and promote digital development of industry. [The Certified Data Analyst (CDA) Talent Industry Standard] is a scientific, professional, international talent skill guideline targeting data-related positions. CDA Exam Outline defines the examination range and key points. Candidates can refer to the outline for acquiring the needed skills and knowledge to become a professional data analyst when preparing the exam.
Exam method: four times a year (the last Saturday of March, June, September, and December), Offline written exam and computer-based exams
Exam question type: Objective choice questions (60 single choice questions, 30 multiple choice questions, 10 content related questions)
Case practical question (1 question)
Exam duration: 90 minutes (objective choice questions), 120 minutes (case practical question), 210 minutes in total
Exam scores: The final exam scores are classified into four grades: A, B, C, and D. Passing grades include A, B, and C, while D is the failing grade.
Examination requirements: Objective choice questions are closed book computer-based exams, without the need to bring calculators and other irrelevant supplies.
For case practical question, Candidates must bring their own computer to operate (installed software with data mining functions such as PYTHON, SQL, SPSS MODELER, R, SAS, WEKA, etc., the computer must have USB copy function and related decompression software to carry out the case operational analysis. The case data will be provided in a unified CSV file).
The three different mastery levels that candidates need to obtain for different types of data analysis knowledge are comprehension, competency, and application. Exam candidates should proceed with their studies based on these different knowledge requirements.
1. Comprehension: Candidates should understand and grasp key points in data analysis regulations, understand the connotation and extension of these key points, distinguish the differences and relation of these key points, and correctly elaborate each key point.
2. Competency: Candidates should master important data analysis knowledge, understand, and memorize relevant theories and methods. They must be able to logically explain data analysis knowledge based on different requirements. Candidates’ knowledgeability and competency with different types of data analysis is the key of this exam.
3. Application: Candidates should be able to demonstrate their ability to apply data analysis theory in practice while combining related tools for commercial application, and propose specific implementation procedures and the strategy to problems based on specific requirements or conditions.
a. Summary of data mining (3%)
b. Data mining methodology (3%)
c. Basic data mining technology (4%)
d. Advanced data mining technology (5%)
a. Advanced data processing (5%)
b. Feature engineering summary (2%)
c. Feature construction (3%)
d. Feature selection (5%)
e. Feature conversion (5%)
f. Feature learning (5%)
a. Summary of Natural Language Processing (2%)
b. Word segmentation and part-of-speech tagging (4%)
c. Summary of text mining (2%)
d. Keyword extraction (4%)
e. Text unstructured data to structure (8%)
a. Naive Bayes (4%)
b. Decision tree (classification tree and regression tree) (5%)
c. Neural network and deep learning(5%)
d. Support Vector Machine (4%)
e. Integration method (5%)
f. Cluster analysis (5%)
g. Association rules (4%)
h. Sequence mode (3%)
i. Model evaluation (5%)
(Examination method of this part is case practice, not included in the proportion of objective choice questions.)
a. Automatic machine learning
b. Category imbalance
c. Semi-supervised learning
d. Model optimization
Data filtering (understand how to use data filtering to establish a segmentation model to improve the prediction effect of the model)
Expansion method of internal/external data
Advanced filling techniques for missing values, including KNN filling and XGBoosting filling
Advanced data conversion technology, including data generalization, data trend discretization
Able to use advanced data preprocessing technology to filter data to establish a segmentation model
Able to use advanced data preprocessing technology to detect and fill missing values
Able to use advanced data preprocessing technology to process data generalization
Able to use advanced data preprocessing technology to process data trend discretization
Capable of evaluating the impact of the above-mentioned different data processing methods on model performance
Importance of feature engineering
Feature understanding
Feature improvement (the impact of data cleaning on features)
Feature engineering coverage
Purpose of feature selection
Feature construction method
Feature conversion method
Automatic learning of features
Promote AI with AI
Preparation before feature construction
Feature null value processing
Standardization of features
Categorical feature coding
Encoding of sequential features
Binning of numerical features
Construct polynomial features
Construct interactive features
Feature normalization
Candidates should be able to use data mining to properly construct features as input for feature selection in the next stage
Invalid variables (irrelevant variables, redundant variables)
Statistics-based feature selection (chi-square test, ANOVA test and T test)
Model-based variable selection (decision tree, logistic regression, random forest)
Selection of highly relevant features
Recursive feature selection
Candidates should be able to use data mining to select key features. At the same time, the impact of different key feature selection methods on model performance is evaluated.
Linear feature transformation-principal component analysis (PCA)
Non-linear feature conversion-Kernel PCA
Feature transformation to maximize the separability between classes-linear discriminant analysis (LDA)
Characteristic transformation of matrix factorization-non-negative matrix factorization (NMF)
Perform feature transformation on sparse matrix-truncated singular value decomposition (TSVD)
Candidates should be able to use data mining for feature conversion. At the same time, evaluate the impact of different feature conversion methods on model performance.
Association rule-based feature learning
Neural network-based feature learning
Deep learning-based feature learning
Text feature learning based on word embedding
Candidates should be able to use data mining for automatic feature learning. At the same time, evaluate the impact of different feature learning methods on model effectiveness.
BOSON's Chinese Semantic Platform
Research category of natural language processing
word segmentation.
Root reduction
Part of speech tagging
Synonym tagging
Concept tagging
Role tagging
Candidates should be able to use BOSON's Chinese semantic platform for language processing
Types and meanings of parts of speech
N-Gram and words
Difficulties in word segmentation and part-of-speech tagging
Regular word segmentation
Statistical word segmentation
Part-of-speech tagging
Candidates should be able to use Chinese word segmentation and part-of-speech tagging technology for word segmentation and part-of-speech tagging for multiple articles
Full text scanning of information retrieval technology
Signature Document of Information Retrieval Technology
Reversal of Information Retrieval Technology Item by Item
Control vocabulary
Keyword Index
Text mining applications
Vector Space Model of Information Retrieval Technology
The process of text mining
Text visualization
Candidates should be able to convert multiple documents and queries into vector format, and calculate the similarity between queries and documents.
Candidates should be able to use text visualization technology to present the content of the file in a word cloud.
TF, DF and IDF
Part of speech
Keyword extraction method
For words in multiple documents and queries, candidates should be able to calculate TF, DF, IDF and parts of speech and extract important keywords.
Bag of words model
matrix decomposition
Word Embedding Model Glove
Word Embedding Model Word2Vec (Skip-Gram & CBOW)
Candidates should be able to train and use the word embedding model for multiple documents.
Candidates should be able to apply structured documents to text classification, sentiment analysis, text clustering and text summarization.
Naive Bayes (independence assumption, normalization of probability, Laplace transform, null value problem)
Candidates should be able to use data mining software to build a naive Bayes model, interpret the model results, and evaluate the effectiveness of the model.
PRISM decision rule algorithm
CHAID decision tree algorithm (CHAID field selection method)
ID3 decision tree algorithm (ID3 field selection method, how to use decision tree for classification prediction, the relationship between decision tree and decision rule, the disadvantages of ID3 algorithm)
C4.5 decision tree algorithm, including C4.5 field selection method, C4.5 numerical field processing method, C4.5 null value processing method, C4.5 pruning method (pre-pruning method, pessimistic pruning Branch method)
CART decision tree algorithm (classification tree and regression tree, field selection method of CART classification tree, pruning method of CART classification tree)
CART regression tree algorithm (the field selection method of CART regression tree, how to use model tree to improve the performance of CART regression tree)
Candidates should be able to use data mining software to build a classification tree model, interpret the model results, and evaluate the effectiveness of the model.
Candidates should be able to use data mining software to build regression tree models, interpret model results, and evaluate model effectiveness.
Overview of BP neural network (understand the origin and development of neural network)
Convolutional Neural Networks (CNN) (Understand the origin and development of Convolutional Neural Networks)
Recurrent Neural Networks (Recurrent Neural Networks, RNN) (Understand the origin and development of RNN)
Perceptron and the limit of perceptron
Multi-Layer Perceptron
BP neural network architecture
Composition of neurons: combination function (Combination Function) and activation function (Activation Function)
Familiar with the way BP neural network transmits information
Modified weight value and constant term
Data preparation before training model (data preparation for classification model, data preparation for prediction model)
Relationship between BP neural network and logistic regression, linear regression and nonlinear regression
Candidates should be able to use data mining software to build a BP neural network model, interpret the model results, and evaluate the effectiveness of the model
Overview of support vector machine
Linearly separable
Best linear segmentation hyperplane
Decision boundary
Support vector
Linear support vector machine
Non-linear transformation
Kernel function (Polynomial Kernel, Gaussian Radial Basis Function, Sigmoid Kernel)
Non-linear support vector machine
The relationship between support vector machines and neural networks
Candidates should be able to use data mining software to build a support vector machine model, interpret the model results, and evaluate the effectiveness of the model.
Overview of integration methods
Sampling technique
Sampling method on training data
Sampling method on input variables
Bagging method (random forest)
Lifting method (Adaboost, xgboost, GBDT, LightGBM)
Candidates should be able to use data mining software to build a combined method model, interpret the model results, and evaluate the effectiveness of the model.
Concept of clustering
Similarity measurement (similarity measurement of binary variables, similarity measurement of mixed categorical variables and numerical variables)
Calculation of distance between sample points (Manhattan Distance, City-Block Distance, Euclidean Distance)
Clustering algorithm (Exclusive vs. Non-Exclusive (Overlapping) clustering algorithm, hierarchical clustering method, partition clustering method)
Hierarchical clustering algorithm (single link method, complete link method, average link method, center method, Ward’s method)
Partition clustering algorithm (K-Means method, EM method, K-Medoids method, neural network SOM method, two-step method)
Density Clustering Algorithm (DBSCAN)
Judgment of the number of groups (R-Squared (R2), Semi-Partial R-Squared, Root-Mean-Square Standard Deviation (RMSSTD), Silhouette Coefficient)
Candidates should be able to use data mining software to build clustering models, interpret model results, and provide marketing advice.
Concept of association rules
Evaluation indicators of association rules (support, confidence, promotion)
Apriori algorithm (disadvantages of the brute force method, the theoretical basis of the Apriori algorithm, the generation of candidate portfolios, the deletion of candidate portfolios)
Support and confidence issues (lift index)
Association rule generation
Extension of association rules (addition of virtual goods, negative association rules, dependency network)
Candidates should be able to use data mining software to build association rule models, interpret model results, and provide marketing suggestions.
Concept of sequential mode
Evaluation indicators of sequence mode (support, confidence)
AprioriAll algorithm (the problem of the brute force method, the theoretical basis of the AprioriAll algorithm, the generation of candidate portfolios, the deletion of candidate portfolios)
Extension of Sequence Mode (State Transfer Network)
Candidates should be able to use data mining software to build sequential model models, interpret model results, and provide marketing advice.
Confusion matrix (Accuracy, Precision, Recall, F-Measure)
KS Chart
ROC Chart
GINI Chart
Response Chart
Gain Chart
Lift Chart
Profit Chart
Average Squared Error
Candidates should be able to use data mining software to compare the pros and cons of different models
Basic concepts of automatic machine learning
Automatic machine learning platform
Methods of automatic data preprocessing
Model building method for automatic machine learning
Automatic model evaluation method
Candidates should be able to use automatic machine learning technology to quickly build models, interpret model results, and evaluate model effectiveness.
Unbalanced data definition
Unbalanced data scenario
Limitations of traditional learning methods in unbalanced data
Problems caused by unbalanced categories
Detection method of category imbalance problem
Over-sampling
Under-sampling
Model penalty technique
Candidates should be able to use unbalanced processing technology to improve the performance of the model
Relationship between supervised learning, unsupervised learning and semi-supervised learning
Basic idea of semi-supervised learning
Basic assumptions of semi-supervised learning
Semi-supervised classification
Semi-supervised regression
Semi-supervised clustering
Semi-supervised dimensionality reduction
Master the semi-supervised learning algorithm based on SVM
Semi-supervised learning algorithm based on kernel method
EM semi-supervised learning algorithm
Candidates should be able to use semi-supervised learning to reduce the cost of developing decision-making models
Purpose of model parameter optimization
Purpose of modeling threshold optimization
Method of model parameter optimization
Method of optimizing modeling threshold
Candidates should be able to use model parameter optimization to build more accurate data mining models
Candidates should be able to use modeling threshold optimization to build more accurate data mining models
Notes: In the recommended study bibliography, some books are combined with software, objective choice part of the exam does not examine use of software, and practical part of the case requires candidates to use relevant software for modeling analysis. Candidates can select reading material from the list of recommended books based on their needs. Candidates do not have to read all recommended books but can study on the key points highlighted in the exam outline.