Data Analysis Text Mining Tutorial

2022-11-17

1、 Text Mining Definition
Text mining refers to obtaining valuable information and knowledge from text data. It is a method of data mining. The most important and basic application in text mining is to realize text classification and clustering. The former is a supervised mining algorithm, and the latter is an unsupervised mining algorithm.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
2、 Text mining steps
1) Read database or local external text file
2) Text Word Segmentation
2.1) User defined dictionary
2.2) Customized stop words
2.3) Word Segmentation
2.4) Steps 2.1, 2.2 and 2.3 need to be cycled to retrieve which words are inaccurate and which words are meaningless
3) Build document entry matrix and convert it into data frame
4) Establish statistical and mining models for data frames
5) Result feedback
3、 Tools required for text mining
This text mining will be implemented using R language. In addition, several R packages need to be loaded, including tm package, tmcn package, Rwordseg package and wordcloud package. The tmcn package and Rwordseg package cannot be downloaded from the CRAN image. For the download methods of these two packages, see the following>>>
4、 Actual combat
The data set used in this paper is from sougou laboratory data, which can be downloaded from the link>>>
This paper integrates the data set and summarizes the news under each topic into a csv table. The data format is shown in the following figure:
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Specific data can be found in the link at the end of the article.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Next, we need to segment the news content. Before segmentation, we need to import some user-defined dictionaries to improve the accuracy of word segmentation. As the text involves military, medical, financial, sports and other aspects, Sogou dictionary needs to be inserted into the dictionary set for this analysis.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
If you need to unload some imported dictionaries, you can use the uninstallDict() function.
Remove all English letters in Chinese before word segmentation.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
The words circled in the figure have no practical significance for subsequent analysis, so they need to be removed, that is, the stop words need to be deleted.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
After the stop word is created, how to delete the words with actual meaning in 76 news articles? The following function is used to delete stop words.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Compared with the previous word segmentation results, it is much thinner here, eliminating meaningless times such as "yes", "de", "to", "this", etc.
The fastest way to judge whether the segmentation result is good or bad is to draw a text cloud, which can clearly see which words should not appear or which words are not segmented accurately.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
There are still some meaningless words (such as "say", "day", "ge", "go", etc.) and words with inaccurate segmentation (such as "golden week" cut into gold, "medical cut into medicine", etc.). For space reasons, we will not add custom words and stop words again.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
At this time, 76 news segmentation results are stored in the corpus.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
It can be seen from the figure that the document entry matrix contains 76 rows and 7939 columns, with rows representing 76 news items and columns representing 7939 words; The matrix is actually a sparse matrix, in which there are 11655 non-zero elements and 591709 zero elements, with a sparse ratio of 98%; Finally, of the 7939 words, the most frequent one appeared in 49 news articles.
Because the sparsity rate of sparse matrix is too high, some words that appear frequently in polar regions will be eliminated here.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
As a result, the number of columns in the matrix is greatly reduced, and the current matrix only contains 116 columns, that is, 116 words.
In order to facilitate further statistical modeling, the matrix needs to be converted to the data frame format.
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
Teach you to do<a href='/map/wenbenwajue/' style='color: # 000; font-size:14px;'> Text mining</a>
summary
Therefore, in the actual process of text mining, the most difficult and time-consuming part is word segmentation. It is a challenge for text miners to accurately segment words and eliminate meaningless words.

Thanks for watching

Join Us

Company/Organization Name:

Company/Organization Site:

Candidate Name:

Candidate Job:

Tel:

Email:

Admission Remarks: (cause and appeal of admission)

Submit application