Construction Procedures
We did not develop the SC-LIWC from scratch for several reasons. First, Simplified and traditional Chinese share very similar terminology and the differences occur mostly in nouns and verbs, which actually have little impact on the LIWC results. In fact, right after the traditional CLIWC was developed two years ago, we tested it on the simplified Chinese texts without any revision and found that even a directly translated dictionary could reach about >80% tag rate and showed quite similar category profile on depression patients’ blogs. Nonetheless, there are indeed some differences between simplified and traditional Chinese. And users have express concerns regarding the possible missing tags.
Following, we would like to summarize some of the works related to the construction of the simplified Chinese dictionary.
1. The traditional Chinese dictionary was first translated into simplified Chinese using the Microsoft translator.
The conversion between simplified and traditional Chinese characters has three problems: one-to-many ambiguity (one simplified character maps to more than one traditional characters), many-to-one (more than one simplified characters map to one traditional character) ambiguity, and different term usage problems. When converting between the two Chinese systems by currently available softwares, the one-to-many and many-to-one ambiguities might misplace words into mistaken categories. This ought to be taken care of in the revising process.
2. Corrections for differences in terminology
There are several sources that explain how terms are used differently between China and Taiwan, and provide matching tables. We collected these lists from the Taiwan Tourism Bureau, Taiwan Strait Tourism Association, Wikipedia, and several other master theses focusing on this specific issue. We combined these words into a single list excluding duplicated ones. The research team look-uped the simplified Chinese dictionary for definitions word by word through the list and examined potential problems. Some words were excluded because they could not be found in the official dictionary or being inequivalent in meaning. Also, there are few cases where one simplified Chinese word could be used to represent several traditional Chinese words. Thus, we had to examine and make sure these words would not be misplaced into mistaken categories. A computer program was also developed to further examine if any redundancy remains.
The SC-LIWC now contains 71 categories and approximately 7450 words. This simplified CLIWC works well on tagging simplified Chinese texts. For example, we used the SC-LIWC to analyze 30 text files focusing on depressive experiences. The 82.9% tag rate suggested that the SC-LIWC has reached a satisfactory level.
Following, we would like to summarize some of the works related to the construction of the simplified Chinese dictionary.
1. The traditional Chinese dictionary was first translated into simplified Chinese using the Microsoft translator.
The conversion between simplified and traditional Chinese characters has three problems: one-to-many ambiguity (one simplified character maps to more than one traditional characters), many-to-one (more than one simplified characters map to one traditional character) ambiguity, and different term usage problems. When converting between the two Chinese systems by currently available softwares, the one-to-many and many-to-one ambiguities might misplace words into mistaken categories. This ought to be taken care of in the revising process.
2. Corrections for differences in terminology
There are several sources that explain how terms are used differently between China and Taiwan, and provide matching tables. We collected these lists from the Taiwan Tourism Bureau, Taiwan Strait Tourism Association, Wikipedia, and several other master theses focusing on this specific issue. We combined these words into a single list excluding duplicated ones. The research team look-uped the simplified Chinese dictionary for definitions word by word through the list and examined potential problems. Some words were excluded because they could not be found in the official dictionary or being inequivalent in meaning. Also, there are few cases where one simplified Chinese word could be used to represent several traditional Chinese words. Thus, we had to examine and make sure these words would not be misplaced into mistaken categories. A computer program was also developed to further examine if any redundancy remains.
The SC-LIWC now contains 71 categories and approximately 7450 words. This simplified CLIWC works well on tagging simplified Chinese texts. For example, we used the SC-LIWC to analyze 30 text files focusing on depressive experiences. The 82.9% tag rate suggested that the SC-LIWC has reached a satisfactory level.