A processing exemplar
We tested the SC-LIWC on a sample of 30 texts on depressive experiences collected from Internet.
The Chinese texts have to be segmented first to add space between words to be further processed by LIWC. Stanford Word Segmenter (SWS) is suitable for the simplified Chinese text processing, so we use it to do the word segmentation.
It is also critical when punctuations are processed. Since the texts are processed by LIWC program, which is designed for English environment in which punctuations are half-width types. Nonetheless, punctuations are usually full-width in Chinese texts. Also, in order to be processed correctly, there need to be no space between punctuations and their previous words but in the mean time, separated from their following words with a space.
In sum, we suggest researchers should read through all the texts in person to correct mis-typed words first. The SWS can then be used to segment sentences. The output files from the SWS , which ought to be saved in UTF8 format, should be further checked to correct the positioning of punctuations. We have developed a utility that could automatically process output files from the SWS . Users are welcome to download them for free.
Then we feed the pre-processed texts into LIWC program referring SC-LIWC as its dictionary. The total tag rate reached 82.94%, which is equivalent to what has been usually found in the analyses of traditional Chinese texts with the TC-LIWC. Moreover, the averaged ratio of function words was 57.67%, which is also equivalent to the other language LIWC versions. Thus, the SC-LIWC has satisfactory tag rate when applied to the simplified Chinese environments. We also continue to examine the SC-LIWC’s tag rates and validities on various topics. Please refer to the related links for more information.
The Chinese texts have to be segmented first to add space between words to be further processed by LIWC. Stanford Word Segmenter (SWS) is suitable for the simplified Chinese text processing, so we use it to do the word segmentation.
It is also critical when punctuations are processed. Since the texts are processed by LIWC program, which is designed for English environment in which punctuations are half-width types. Nonetheless, punctuations are usually full-width in Chinese texts. Also, in order to be processed correctly, there need to be no space between punctuations and their previous words but in the mean time, separated from their following words with a space.
In sum, we suggest researchers should read through all the texts in person to correct mis-typed words first. The SWS can then be used to segment sentences. The output files from the SWS , which ought to be saved in UTF8 format, should be further checked to correct the positioning of punctuations. We have developed a utility that could automatically process output files from the SWS . Users are welcome to download them for free.
Then we feed the pre-processed texts into LIWC program referring SC-LIWC as its dictionary. The total tag rate reached 82.94%, which is equivalent to what has been usually found in the analyses of traditional Chinese texts with the TC-LIWC. Moreover, the averaged ratio of function words was 57.67%, which is also equivalent to the other language LIWC versions. Thus, the SC-LIWC has satisfactory tag rate when applied to the simplified Chinese environments. We also continue to examine the SC-LIWC’s tag rates and validities on various topics. Please refer to the related links for more information.