Pinyin Orthographical Rules for Libraries:
a Follow-Up

Victor H. Mair, Professor
Dept. of Asian & Middle Eastern Studies
University of Pennsylvania
Philadelphia, PA 19104-6305
USA
vmair@sas.upenn.edu


As promised in an interim announcement of November 1, this message is being posted by way of comment on the tremendous number of responses that were forwarded to me after my original October 29, 2000 essay concerning the disregard of pinyin orthographical rules by the Library of Congress. First of all, I wish once again to thank each and every one of you who voiced their opinions on this very important matter. I deeply appreciate your thoughtful consideration of the momentous consequences that would result from flouting conventions for Chinese word division that already exist. I regret that the sheer quantity of responses makes it impossible for me to reply individually to all of you, especially many old friends whom I had not heard from for many years. However, in this message, I will endeavor to provide answers that cover most of the issues raised.

By far the largest amount of messages that came to me consisted either entirely or partially of a request for the names of responsible individuals at the Library of Congress to whom protests could be registered. Although I have not been able to determine who in LC is in direct charge of the Pinyin Conversion that began in October, I think that if we send our concerns to the following two addresses, they will get to the right people:

Beacher J. E. Wiggins
Director for Cataloging
Library of Congress
Washington, DC 20540-4300
Email: bwig@loc.gov

Philip Melzer
(he is the Consultant of the Committee on Technical Processing, The Council on East Asian Libraries)
Korean/Chinese Team
Regional and Cooperative Cataloging Division
Library of Congress
Washington, DC 20540-4380
Phone: 202-707-7961
FAX: 202-707-2824
Email: pmel@loc.gov

As noted in my brief interim acknowledgement of the responses that had come to me within the first week of my original posting, virtually all sinologists, linguists, historians, and information processing specialists were strongly in favor of the rational aggregation of Sinitic syllables into words. The only exceptions are a minority (perhaps 40 percent or even less) of East Asian librarians who declared that they would follow LC's current policy of separating all syllables without dissent. Their main reason for following such a policy of least resistance is basically that most of the Wade-Giles records that have been accumulated during the past century do not group syllables into words and that it would simply be "too difficult" for them to take on the burden of aggregation.

While I can understand the reluctance of some librarians to expend energy and time thinking about when to aggregate and when not to aggregate, ignoring the problem is not going to make it go away. Also, as I pointed out in my previous posting, it will ultimately be more inefficient and less cost effective to take the easy course of ignoring pinyin orthography for the time being. Eventually, the demands for speedy searches will require the application of rules for aggregation, so why not institute them now rather then go through another horrendous and costly conversion of disaggregated (the current policy) to aggregated records later? Furthermore, I already provided reliable resources for rational word division in my first posting. Here are some more:

RLIN already has a system for recognizing the grouping of characters into words. It uses a character called an "aggregator" between syllables of the same word. RLIN files are abundant and offer a ready-made fund of aggregated records that could be consulted during conversion.

Linguistic researchers in China, Taiwan, and Hong Kong have amassed databases consisting of millions of words of actual texts and analyzed their contents in terms of both word frequencies and character frequencies. Some of the word frequency lists are already in electronic form and, with minimal tailoring, could be put to use almost immediately by librarians and other information specialists.

Aggregated pinyin is reported in use at the National University of Singapore (http://linc.nus.edu.sg/) and at Paris 8 Universite (http://www-bu.univ-paris8.fr/absys.html). The Singporeans and the French are smart!

I am pleased to report some exciting new developments concerning my favorite portable Chinese-Chinese dictionary, the Xinhua zidian, which is actually a character dictionary, not a dictionary of words (although one can use it to see how the characters it includes are formed into words) and has been printed in many millions of copies. First of all, there now exists an edition of this dictionary entitled Xinhua zidian Hanyu pinyin ban (Taiyuan: Shanxi Jiaoyu Chubanshe, 1999), prepared by Yi Ken'ichiroo, Dong Jingru, and Yamada Ruriko, that gives the complete contents of the dictionary both in characters and in very carefully aggregated / segmented pinyin. Equally cause for jubilation is the splendid new Han-Ying shuangjie Xinhua zidian (Xinhua Dictionary with English Translation) issued in the year 2000 by the Commercial Press (the original publishers of the Xinhua zidian). This dictionary not only translates the complete contents of the Xinhua zidian into English, it also provides scrupulously orthographical pinyin for all definitions and illustrative sentences. Incidentally, one may find on pp. 20-21 of this gem a synopsis of the official orthographical rules (GB/T 16159-1996).

The venerable Gwoyeu tsyrdean, (i.e., Guoyu cidian) embodies the vast amount of wisdom that was brought to bear in the design and implementation of National Romanization by China's most outstanding linguists during the first half of the twentieth century. It merits consultation by those who care about proper aggregation of written Mandarin.

The great Chinese-Japanese dictionary of Mandarin entitled Chuugoku daijiten (Kadokawa, 1994) has over 200,000 entries, all of them with the rules for aggregation and segmentation carefully applied.

For the last decade, I have been preparing a single-sort alphabetical index to all of the 350,000 or so entries of the Hanyu dacidian. From the very beginning, I insisted that the pinyin entries should be segmented according to the orthographical rules, but my Chinese collaborators balked, saying that it was "too hard." (Of course, it was even "too hard" in many cases to figure out how the entries were pronounced[!], but that is something we had to do no matter what. Determining the pronunciations of thousands of entries required extensive research, but we have finished that part of the work.) Now, "too hard" is not an adequate excuse, especially when the benefits to be accrued from correct segmentation so far outweigh the disadvantages of no segmentation. Consequently, the 350,000 entries of the alphabetical index to the Hanyu da cidian, when it comes out in a year or two, will serve as another major resource for librarians who need to make decisions about aggregation.

Those who are accustomed to believing that Sinitic languages lack words should perhaps begin by glancing through Xiandai Hanyu Cidian (Commercial Press), which is probably in even the smallest libraries that have Chinese collections. They would also do well to consult the marvelous Concise Dictionary of Spoken Chinese by Yuen Ren Chao and Lien Sheng Yang (Cambridge: Harvard University Press, 1947) which gives very precise information about which characters can be used alone (i.e., are "free") and which must be used in combination with other characters (i.e., are "bound" in various ways and to various degrees).

I now give a few specific instances of difficulties raised by East Asian catalogers:

Should it be Falun Gong or Falungong? I realize that in the media and perhaps even in the literature of the religion itself Falun Gong is being widely used. However, according to the General Guidelines 0.2 of the "Basic Rules for Hanyu Pinyin Orthography" adopted by the State Language Commission of the PRC, structures of two or three syllables which express an integral concept are to be written together as one word. In contrast, according to guideline 0.3, terms of four or more syllables which express an integral concept are to be divided on the basis of word boundaries or juncture, hence Falun Dafa. Falungong is a type of "qigong", but when the "qi" of that word is elided, the remaining "gong" joins directly to a bisyllabic term which modifies it. "Gong" is a bound syllable and normally does not stand alone in Mandarin.

With regard to the famous novel known in English as Dream of the Red Chamber or The Story of the Stone, one respondent asks whether we should write Honglou meng or Hongloumeng. Here it should be Honglou meng rather than Hongloumeng. There are many reasons why this is so, chief among them the fact that "meng" is a free syllable.

One Chinese cataloger actually thought that it is desirable to retain ambiguity by not linking syllables together into words! She cited the three syllables "xi mao yang" and stated that, by not joining them, we leave open the question of whether it is a "thin wool-sheep" that is intended or a "sheep with thin wool." First of all, I must observe that "maoyang" is not a sanctioned word in Mandarin for "wool-bearing sheep." The correct term is "mianyang". Secondly, we would not refer to a thin sheep, regardless of whether or not it bore wool, as "xi", but rather as "shou". Be that as it may, let us assume that "xi mao yang" is an acceptable collocation of Mandarin syllables (while knowing fully well that it is not!). For the life of me, I cannot fathom the utility of wanting to confuse one's reader about whether one intends a "thin wool-sheep" or a "sheep with thin wool." Surely the purpose of communication would be better served if we were emphatic about whether we meant a "ximao yang" or a "xi maoyang"!!

A couple of Chinese librarians who agreed that words are important for Sinitic languages nonetheless claimed that, since library records are "merely for information retrieval," it would be all right to ignore words in library records. Such a position is beyond my ability to comprehend. Wouldn't it be easier, both for patrons and for the machines they employ, to search for "bianwen" ("transformation text") than for "bian wen" (who knows what that might mean?) and to search for "zhenjiu" ("acupuncture and moxibustion") rather than for "zhen jiu" ("true wine"? "truly a long time"? or who knows what else?)?

A very serious misunderstanding raised by a few librarians who admitted to having no expertise in linguistics is that romanization is a "stopgap" that will eventually fade away and that the ideal -- as much as possible -- is to dispense with it entirely. According to these individuals, romanization is an "unwelcome" excrescence in the application of computers to East Asian languages. Quite the contrary, romanization is essential for the maximally efficient utilization of advanced information technology by the lay patron and its role is growing daily. Despite the massive government support for "wubi zixing" in China, more and more people, especially young people, opt for romanized inputting (often using programs with more or less automatic conversion to characters in sophisticated programs that take pinyin orthography into consideration). Why? It is much easier to learn; it can be used for touch typing and composition rather than merely for copying; for all but a very small percentage of the most advanced, full-time professional "wubi zixing" typists, romanized inputting is much faster. The existence of literally hundreds of shaped-based inputting systems similar to "wubi zixing", with more being devised every month, demonstrates the unsatisfactory nature of this approach. Library administrators should realize that libraries exist for patrons to use and that the vast majority of patrons have neither the time nor patience to master an inputting system like "wubi zixing". Only romanized (or other phonetic) inputting, with its direct correlation to natural speech, constitutes a workable, user-friendly machine-human interface for library patrons. This is as true of China as it is of North America. (Remember, no one ever spoke a character [except in cartoon balloons], but hundreds of millions of people from all walks of life -- both literates and illiterates -- speak the sounds of Chinese words everyday.) When we couple the public convenience of romanized inputting with the enhanced search capabilities of properly orthographical pinyin files, information technology works for the patron instead of throwing formidable stumbling blocks in his or her way.

I ask those librarians who hope that romanization will swiftly disappear, "How, pray tell, are your patrons going to gain access to the records in your catalogs and databases?"

Assuming that those who possess an innate dislike of romanization were victorious and succeeded in banning its use in connection with Chinese library holdings, on what principles and employing what methods would Chinese characters be entered into, ordered within, and retrieved from computer catalogs and databases? How would Chinese characters mysteriously float into and out of computers if romanization were barred? Which of the hundreds of shape-based inputting and ordering systems would be chosen to replace the entering, ordering, and retrieving functions of romanization?

These questions are especially critical for employees and patrons (particularly our students) whose command of the 50,000+ characters is, shall we say, something short of perfect -- which means most of our libraries' employees and nearly all of our libraries' patrons who are, after all, LEARNERS of Chinese. Our library institutions and their facilities are among the most important elements of the educational process. It would be a terrible betrayal of the sacred mission of our libraries if their TEACHING potential were diminished / destroyed by failing to capitalize upon the tremendous potential of romanization for helping students and other patrons to work with Chinese characters and sources before they have fully mastered them.

If the authorities at the Library of Congress do not believe in the utility of pinyin (and I mean pinyin in the fullest sense of the word), why put everybody to the tremendous labor and formidable expense of converting millions of records from Wade-Giles to pinyin? If librarians really think that romanization is so worthless, why not just keep the old records in Wade-Giles (or strip it out, leaving only Chinese characters) and enter only Chinese characters for the new records? Of course, despite their scorn for pinyin, they know better. First of all, they know that their patrons would revolt because they would be able to find things in character-only catalogs only with utmost inconvenience. How would searches be carried out? How would catalogs be organized? Radicals? Which set of radicals (there are so many out there nowadays -- the comforting 214 Kangxi radicals that we all used to memorize are pretty much a memory of the past)? The Rosenberg system? Four corners (I'm one of only three people I know who can use that system)? Lin Yutang's system? There are endless other possibilities.

No matter how much some librarians may despise romanization, they are aware that, without it, their collections would fall into chaos or nearly nil circulation -- insofar as public access is concerned.

To be sure, people (and machines) will be confused by and resent series of unsegmented syllables such as the following: "wu si yun dong bu shi yi ge ou fa shi jian zhe ge yun dong de fa sheng you li shi shang de bi ran xing ta dui zhong guo chan sheng le ji da de ying xiang". As soon as we aggregate them, however, these disjointed syllables turn into words and become instantly intelligible: "Wusi Yundong bu shi yige oufa shijian; zhege yundong de fasheng you lishishang de biranxing. Ta dui Zhongguo de jindai lishi chanshengle jida de yingxiang". ("The May Fourth Movement was not a fortuitous event; the outbreak of this movement was historically inevitable. It exerted an enormous influence upon recent Chinese history.") Speakers of Mandarin do not drone on like this: "wu si yun dong bu shi yi ge ou fa shi jian ..." or this: "wusiyundongbushiyigeoufashijian ...". Rather, one can hear Chinese speakers grouping syllables into words and inserting brief pauses between words like this: "Wusi Yundong bu shi yige oufa shijian ..." and one can actually *see* these pauses / breaks / spaces between words on an acoustic wave graph.

The alphabetical writing of Sinitic texts is neither new nor strange. I cannot go into the entire history of this subject here, but it can be traced back to at least the Tang Dynasty. Those who are interested in learning more about the writing of Sinitic languages in phonetic (alphabetic or syllabic) scripts should consider -- among many other pertinent examples that could be named -- Dungan (a northwest Mandarin topolect) literature written in Cyrillic, Nyushu (Women's Script) from south central China, Y. R. Chao's books full of "sayable Chinese" all typed out in the roman alphabet, the tens of thousands of Hoklo speakers who were literate only in romanized Minnan Baihua, the rich assortment of romanized materials written in Latinxua (Bei La, Sin Wenz) for combatting illiteracy in the 1930s, and so forth. And today, thousands of persons every day are writing e-mail messages on all manner of subjects to each other in pinyin. Who says that romanized Chinese is unwelcome and unworkable? Only someone who is ignorant of history and ill-informed about the present.

Most (nearly all) Japanese language computer inputting in Japan is done using either "kana" (phonetic syllabary) or romanization. The Japanese experience with romanized writing is extensive. For example, the Japanese poet Ishikawa Takuboku (1886-1912) wrote a powerfully revealing Romaji Nikki (Diary in Romanization). And all Japanese know that a library is a "toshokan", not a "to sho kan". Do the folks at the Library of Congress want to start romanizing Japanese syllable by syllable in the infantile manner they have instituted for Chinese? Mandarin "tushuguan" is essentially the same word as Japanese "toshokan". Indeed, Mandarin "tushuguan" was borrowed from Japanese "toshokan", probably during the 1880s. So why should the Library of Congress permit the Japanese to romanize this word as "toshokan", yet force the Chinese to romanize the same word in the opaque sequence of three disconnected syllables as "tu shu guan"? Does LC look down upon the Chinese, treating them as though they cannot put three syllables together into a single word? If so, this is a demeaning attitude to adopt. If not, what then is the reason for treating the same word differently in Chinese and in Japanese? I think that all Chinese people and all China specialists deserve an explanation from the Library of Congress for the difference in approach to these two languages.

The urgent need for aggregation is underscored by the fact that many of China's very best computer specialists, such as Feng Zhiwei of the State Language Commission, have seriously proposed the adoption of spaces between words even for purely Chinese character texts. Indeed, I recall reading an article by the well-known classical scholar, Chow Tse-tsung, calling for word separation in Chinese character texts. This article, which must have been written at least twenty or thirty years ago (long before the advent of computers), was shown to me by the author himself in his office at the University of Wisconsin (Madison). Professor Chow was very proud of his proposal and thought that it would lead the way to less ambiguity in the reading and writing of Chinese. Be that as it may, only by adopting rational word separation can maximum efficiency be attained in database searching and other information processing operations.

Furthermore, logical segmentation of words in Chinese texts decreases the amount of ambiguity in reading, whether of titles or of whole sentences and texts. Let us take the following series of syllables: "fa zhan zhong guo jia chan quan". Regardless of whether those syllables are written out separately in pinyin or in characters, they cause the reader to halt in an effort to parse them for meaning. As soon as they are segmented (whether in pinyin or in characters), however, they yield sense: a. "fazhan Zhongguo jiachanquan" ("develop Chinese family property rights"); b. "fazhanzhong guojia chanquan" ("property rights [in] developing countries").

It is because of such obvious utility in information processing that China's best applied linguists, such as Wang Jun, Su Peicheng, Liu Yongquan, and many others, have been strenuously urging the adoption of a system of digraphia, with properly segmented pinyin playing a key role in information processing and characters being used for more traditional purposes. In fact, there already exists such a de facto digraphia in China, so there is no harm in our adopting it outside of China too.

Even the Vietnamese, who have been writing their language for a century in disaggregated romanized form, have begun to experiment with aggregation for the purposes of greater clarity and to facilitate electronic information processing. The results so far have been encouraging.

For those who still doubt that Sinitic languages have words, I would urge that you read the following two books:

Jerome L. Packard. The Morphology of Chinese: A Linguistic and Cognitive Approach. Cambridge: Cambridge University Press, 2000.

Jerome L. Packard. New Approaches to Chinese Word Formation: Morphology, Phonology and the Lexicon in Modern and Ancient Chinese. Trends in Linguistics, Studies and Monographs 105. Berlin, New York: Mouton de Gruyter, 1998.

I do not deny that librarians will have to think a bit more and work a bit harder in the early stages of converting unnaturally unsegmented / disaggregated strings of Chinese syllables into orthographically correct pinyin because they may not be used to performing this task. In that sense, it will be "harder" for them to create aggregated catalog records at first than to create disaggregated records. But nearly all worthy and important achievements of humanity are difficult -- putting a man on the moon certainly wasn't easy, but it was worth doing. I humbly submit that the overall public benefits from aggregated library records will far outweigh the personal discomfort of librarians while they are gaining familiarity with linguistic verities.

Apparently, my worst fears about the lack of ability in Chinese on the part of the individuals at the Library of Congress who were responsible for rejecting aggregation have been borne out, since I have been informed on good authority that the current conversion from Wade-Giles to pinyin is in the hands of a Korean specialist, not a Chinese specialist. It is not too late to rectify the situation and press for the appointment of individuals well versed in Chinese to oversee this momentous conversion.

Please forgive me for the length of this message; I promise not to fill your undoubtedly overflowing mailboxes with comparably lengthy remarks in the future. The only reason I have written two such elaborate missives is that the stakes are so high. Thank you for taking them into consideration.


Copyright © 2000 Victor H. Mair.
Submitted to CLIEJ on 30 November 2000.