Construction Procedures

The development of Chinese LIWC dictionary is based on LIWC2007 dictionary. Six stages of work were completed to establish this dictionary.

Stage one: Direct translation

There are 64 categories in the English LIWC dictionary, which include approximately 4500 words or word stems. These words are categorized based on their linguistic or psychological meanings. These categories also have hierarchical structures among them. Moreover, in the LIWC dictionary, one single word could belong to multiple categories because it may carry multiple meanings or characteristics. This has profound impact on the translation processes. Since each word could carry multiple meanings, it thus could be translated into different Chinese words. In order to cover most of these possible translations, we decided to translate each word under its categories to reflect its various meanings under different contexts. Consequently, there were more than 10600 to-be-translated words in total. However, words in categories of article, past tense, present tense, and future tense were deleted because there is no such usage in Chinese.

These words were then evenly divided into five groups. For each group, two translators independently translated all the words assigned to that group. The translators were undergraduate research assistants and were thoroughly debriefed before starting their works. They were asked to fully understand the meaning of each category first and then tried to come up with as many translations of each word as possible that best match with the meaning of that category. They could survey different dictionaries, or online resources during the process, if necessary. By the end of this stage, more than 18000 Chinese words were obtained.

Stage two: Category identification.

The purpose of this stage is to confirm the appropriateness of categorizing the translated words. To define appropriateness, inter-judge consensus was used as criteria. At the beginning, a total of nine judges were asked to rate the appropriateness of assigning each word to its corresponding category with yes or no binary decision. The researcher fully explained the meaning of each category as well as the related criterion for the judgment. Each word was rated by three judges and more than 13000 words reached complete consensus on these three ratings. For these thirteen thousand words, they were either excluded or included in their corresponding categories based on these consensual ratings.

Six more judges were recruited to further rate the remaining about 6500 non-consensual words. Each judge rated about 3000 words. Their ratings were polled with the ratings from the previous steps. As a result, about 5000 words could be determined to be included or excluded in the corresponding categories if the majority of judges made the same rating clearly.

However, there were still about 1500 words remained tie. Three of the authors and two other research assistants who were familiar with the current research went through these words independently. When more than three out of the five people agreed on the categorization, the word was assigned to that category. By the end of this stage, a total of about 6600 words were included.

Stage three: Word segmentation.

There are some cases where it may take more than two or more Chinese words to fully describe an English word. For example, the preposition “on” has to be translated as “on top of” (three Chinese words). Nonetheless, a phrase of more than two Chinese characters may very well be separated by the Chinese Word Segmentation System (CWSS, http://ckipsvr.iis.sinica.edu.tw/). The CWSS is developed by the AcademicSinica in Taiwan. This system is the most widely used one in Taiwan (Ma & Chen, 2003).

We check all of the six thousand and six hundred some words from the previous stages using the CWSS to make sure all the words in the dictionary could be accurately identified. However, in cases like previously mentioned where a phrase could not be treated as one unit word in the dictionary, we would discuss in group to determine how these words in the phrase should be categorized. We examined all these cases and resolve the possible discrepancies. By the end of this stage, six thousand and five hundred some words were kept.

Stage four: reconfirmation of word categorization

To ensure that the words are appropriately categorized, the research team again went through the definition of each category and examined the appropriateness of assigning the words in each specific category. The research team met in group of at least 5 members (three authors of this paper were constantly involved each time) and all members were familiar with this research project and the related analyses. In all of these group discussions, if a word was to be deleted or added to the category, group consensus must be reached first.

One other important task in this stage is to determine the appropriateness of using asterisk in word stem. LIWC allows using word stem with asterisk to represent a group of words sharing the same word stem. For example, “infinit*” represents all words that begin with “infinit” to be counted. The same principle applies to Chinese too. However, the usage of asterisk must be exercised with extreme caution because the same word stem might generate words which carry meanings totally out of the current category. The research team used various dictionaries, word data bases, and CWSS to assure the appropriateness of using asterisk.

Stage 5 adding categories

In the process of previous group discussions, we realized that there are categories unique to Chinese. For example, different units were used to count different objects in Chinese. Although there is no verb tense in Chinese, some words are then used as time makers, such as “in the past”, “already”, or “now”. The authors referred to the Word List withAccumulated Word Frequency in Sinica Corpus 3.0 (1998), which surveyed 5 million words and provided 150 thousand words with word category and frequency information. As a result, eleven categories unique to Chinese were added to the dictionary including second person plural pronoun, preposition phrase end, specifying article, quantity unit, inter-junction, multiple functions, tense marker, past tense marker, present tense marker, future tense marker, and continuation marker. For detailed explanations of these categories, please refer to our C-LIWC website.

To ensure that the most frequently used words are mostly included in this dictionary, the authors picked the mostly frequently used 2000 words from Word List withAccumulated Word Frequency in Sinica Corpus 3.0 (1998) and tested the tagged rate of the current dictionary. Results showed that 78.4% of these words were tagged. Most untagged words are nouns. We added 106 untagged words which in fact fitted with our defined function words categories.

Stage 6: Examine the category structure

The categories in the LIWC dictionary are not totally independent. There are hierarchical relationships among certain categories. There might be possibilities that when we added words into a subordinate category, we did not simultaneously place these words into the superordinate category where they also belonged to or the other way around. Two ways of examinations should be done. The first one is to examine that every word in the subordinate category should also be included in the superordinate category. For this part, we used computer program to do the checking.

The second examination is a bit more difficult. A word in the superordinate category may not fit into any subordinate category. For example, a neutral emotional word should remain in the emotion word category but does not have to be in the positive or negative emotion categories. Our research team with 3 to 5 members presented in a serial of discussion sessions went through these words and decided the appropriate placement of these words in different hierarchies. It turned out that we in fact went through and re-examined the whole word list again.

After the completion of these six stages, the first Taiwanese LIWC dictionary is available for further analysis. The dictionary includes 30 linguistic categories and 42 psychological categories. There are totally 6862 words included. Table 1 indicates some major feature of these categories. The tagged rate for the most frequently used 1000 words in Taiwan is 83.5% , for the first 2000 words is 76.2%, and for the first 5000 words is 58.7%. These tagged rates are satisfactory.