Interviewer: Hello everyone, today we have invited Guo Yinjiao to participate in an online interview with a CDA licensee. Guo Yinjiao is currently in her third year of graduate school, and what's more surprising is that she is currently the head of a data mining project for a public company. Welcome Guo Yinjiao, can you say hello to everyone?
Guo Yinjiao: Hello, my name is Guo Yinjiao, I am currently studying in the School of Statistics and Mathematics at Yunnan University of Finance and Economics, majoring in Applied Statistics. I'm in my third year of study, and I'm also in charge of a data mining project for a listed company in Yunnan.
Question 1.
I've read that you've won many competitions during your studies, can you tell us about this experience?
YINJIAO GUO.
I have participated in many mathematical modeling competitions from undergraduate to graduate school, both on campus and off campus.
The first time I participated in a modeling competition was when I was a sophomore, and it was a warm-up for the national competition called Mathorcup.
Like other undergraduate modeling competitions, the competition required me to build a model and submit a paper based on a given topic within three days, and I participated in the competition with a team of two students from the same class. Since we had no previous experience in modeling competitions, we did not plan what to do before the competition.
However, I realized that it was not easy to solve a specific problem and submit a good paper in a limited time.
Firstly, it is a test of one's problem solving ability, which includes the accumulation and application of knowledge; and the ability of hands-on time, which means programming ability for modeling, which cannot be learned immediately in the days or weeks before the competition; secondly, it is a team division of labor, there are three major parts of the mathematical modeling competition, namely modeling, programming and writing, each of which needs a core person in charge. We did not have a good division of labor in the competition, which seriously affected the progress. The last one is the teamwork ability, for the first time to participate in the competition, the time of three days should be very tight, this time need the same pace between the team members, unified thinking, a harmonious team atmosphere can largely enhance the efficiency.
The result of this competition is predictable, we did not get the award, but on the other hand, we have accumulated experience. So in the later National Student Mathematical Modeling Competition and the Asia-Pacific Student Mathematical Modeling Competition, we developed good preparation habits, such as reading an excellent modeling paper several days in advance, exchanging ideas of the paper, learning the programming model, etc. Although we occasionally stayed up all night in the lab during the competition, we got a very good prize in the end, and it became a very special and valuable experience for me. It became a very special and valuable experience for me.
Question 2.
You studied statistics in the science field and applied statistics in the economics field in graduate school.
YINJIAO GUO.
Most of the courses are related to mathematics and statistical theory, such as mathematical analysis, advanced algebra, differential equations, stochastic processes, advanced mathematical statistics, time series analysis, multivariate statistical analysis, non-parametric statistics, etc. In applied statistics, more emphasis is placed on practice, in addition to statistical theory, economic statistics also requires some courses related to economics. In addition to statistical theory, economic statistics also need to learn some economics-related courses, more is to apply theoretical knowledge to practical problems, if it is the direction of data mining, you need to learn machine learning and other similar broader application of the model and theory. To sum up, there are two aspects: statistics is more academic, while applied statistics is more software and domain specific, more practical and suitable for the workplace.
Question 3.
You are already a project leader at work, so you must have a deep memory of your first real-world project, can you share it with us?
Guo Yinjiao.
Yes, I have been in contact with my mentor's data mining project since my first semester in graduate school. When I found out I was going to do this project, I spent my free time during winter break learning SQL, including Oracle and MySQL.
I watched the video once and wrote it by hand. I thought I had mastered it well at that time, but when I really went to the actual production environment, I realized the complexity of the problem. First of all, the enterprise data is no longer simply a few tables, but hundreds of thousands, and the logic is closely linked to each other, and secondly, it needs to be combined with specific business needs to analyze, rather than simply writing SQL.
When I first started to do data mining project, I subconsciously thought that it should be similar to modeling in school, and should test the ability of modeling to solve practical problems, but I found that it is not.
Data mining includes steps of information collection, data integration, data cleaning, feature engineering, modeling analysis and evaluation.
If it can be achieved, what is the effect; in this step of information collection, the first and most important thing is to understand the company's business comprehensively, because the data is generated by the business, through understanding the business to master the company's existing data and the connection between different modules of data; data integration, the need to put different sources, different formats of data in a logical or physical concentration, which will test the analysts' ability to use tools. This tests the analysts' ability to use tools.
I personally did not pay much attention to data pre-processing when I was in school, but when I came to the actual environment, I realized that the data cleaning and feature process took up most of the time of the whole data mining project, how to deal with outliers and missing values? How to construct features?
The solution of these problems test the ability of applying theoretical knowledge and programming ability, and they cannot be separated from the business environment. Another difficulty is that I need to keep learning things that I have not touched before. In addition to learning new models, there are other non-statistical modeling knowledge, for example, if I want to provide the results to the company's system in the form of an interface, I need to learn the back-end knowledge, and if I want to evaluate the effect of my model from different dimensions efficiently, I need to learn some visual analysis tools, such as power BI, etc. All of these have in effect made me get more knowledgeable. All these invariably make myself get a lot of exercise and improvement.
All in all, a complete data mining project experience makes people gain a lot and makes their career direction more clear.
Question 4.
As a CDA holder, what is your best preparation strategy for passing the certification exam? How do you arrange the time for preparation? Did you encounter any difficulties?
Guo Yinjiao.
From the beginning of my preparation to the exam, it took me about half a month, during which I would set aside one morning to systematically read the statistics-related textbook if I had not taken the CDA before, and then spend 1~2 hours every day to do the questions and analyze the mistakes.
It is not recommended to blindly watch the textbook and video at the beginning, it is recommended to read the syllabus and do two sets of questions to understand the content of the exam and the type of questions, to have a general grasp of the content of the questions, and in the process of doing the questions for the first time to write down the knowledge points that you will not or are not familiar with, and later focus on these knowledge points when reading the textbook; in addition, study through the two sets of simulation questions given by the CDA teacher, not only look at the wrong questions, do the right question options to You have to understand what the options of the correct questions mean and take good notes, because most of the questions cannot be separated from these contents.
SQL-related knowledge, if there are students doing related data analysis projects, you can take the opportunity to practice, because the fastest way to master the knowledge is to apply it in practice; if there is no actual practice environment, when watching the video learning, you can first write once, and then watch the teacher's explanation, you can also find a question bank online to practice.
The difficulty of knowledge is more concentrated in the content that needs to be combined with business analysis, such as reporting tools, table connection relationships, etc., which I have little contact with in school study and projects, and also need to spend more time.
Question 5: Is it too late to start learning data mining in graduate school with zero foundation?
Yinyinjiao Guo.
You should have heard the saying "the best time to plant a tree is ten years ago, followed by now". If you have zero foundation, after learning the basic theory, it is recommended to learn in practice and start a project related to data mining, which will be much more efficient than simply learning from books, and practice is the best teacher.
Conclusion.
The best time to plant a tree is ten years ago, followed by the present. There is a saying circulating on the Internet for a while: the future you will thank yourself for your hard work now. I hope more people can join the field of data analysis and data mining, so that data can play a greater value and data talents can have unlimited possibilities.