Pinyin Orthographical Rules for Libraries:
a Recent Literature Review

Victor H. Mair, Professor
Dept. of Asian & Middle Eastern Studies
University of Pennsylvania
Philadelphia, PA 19104-6305

It has been quite a while since my last communication concerning pinyin orthography for libraries and other organizations that must rely heavily on information processing. While I do not wish to continue forever to function as a clearing house for materials relating to this topic, I have recently become aware of several important, relevant sources, so I thought that I should share them with you.

The first, and most impressive, is the Ph.D. dissertation of Clement (accute accent over the first "e") Arsenault, Assistant Professor in the Graduate School of Library and Information Sciences at Simmons College, Boston. Because I do not wish to distort Professor Arsenault's findings in the slightest, I shall simply copy verbatim here the abstract of his thesis:

"Word Division in the Transcription of Chinese Script in the Title Fields of Bibliographic Records"

Doctor of Philosophy, 2000

Clement Arsenault

Faculty of Information Studies
University of Toronto

Thesis Supervisor: Lynne C. Howarth

In online bibliographic databases, Romanization of Chinese script can enhance access by facilitating filing, searching, and browsing of records. Recently the Library of Congress announced the replacement of the Wade-Giles Romanization system with the pinyin Romanization system for transcribing Chinese data in its bibliographic records. This decision will have a great impact throughout the North American library community. In its canonical form, pinyin, as opposed to Wade-Giles, aggregates Chinese "words" into single linguistic units. Since Chinese characters represent monosyllabic morphemes rather than words, Chinese text, in its original form, does not provide visual cues as to where a word starts or ends, and, therefore, does not provide guidance for joining syllables when the script is Romanized. In this respect, pinyin entries in bibliographic records could be constructed following either a monosyllabic, or a polysyllabic pattern. Although the former is easier and less costly to implement, the latter method is potentially more beneficial for end-users, since combining single syllables into linguistic units greatly reduces ambiguity, and generates a much larger variety of indexable terms, thus improving precision in online retrieval. The goal of the current study was to investigate if following the polysyllabic method significantly improves retrieval efficiency and effectiveness in item-specific searching within online bibliographic databases. Analysis of the results revealed that aggregation of monosyllables does improve efficiency significantly (_p_ < .05), especially during keyword-based searches, and that effectiveness is unaffected by the inconsistencies observed in the aggregation format between cataloguer-generated records, and user-input queries.

Professor Arsenault's dissertation directly addresses in a dispassionate, objective, scientific manner precisely the question that I raised in my previous messages about the advisability of using proper pinyin orthography in bibliographic records. This is an empirical investigation based upon masses of data gathered through unbiased observation. Having read through the dissertation, it only reinforces my conviction that the Library of Congress (and all other libraries that follow its lead) will one day have to redo all of their pinyin records to take into account the linguistic reality of segmentation and aggregation.

Because of the tremendous value of Professor Arsenault's research in making clear exactly how the use of pinyin orthography in library databases increases the efficiency with which records are ordered, sorted, and retrieved, I have made several extra copies of his dissertation. I will be happy to share these copies with interested readers, providing they promise to return them to me within two weeks of receipt. If you wish to read Professor Arsenault's dissertation, please provide your postal address and I will mail a copy to you promptly.

Next I would like to call your attention to several recent articles in Chinese linguistics journals that have a direct bearing on the subject of our discussion:

1. ZHOU Xinping, "Hanyu pinyin zai Beimei diqu de tuixing: ji Beimei tushuguan you Weishi yinbiao xiang Hanyu pinyin de zhuanhuan [The Application of Hanyu pinyin in North America: On the Conversion of North American Libraries from Wade-Giles Romanization to Hanyu Pinyin]," _Zhongguo yuwen [Chinese Language and Script]_, 1 (cum. 280) (2001), 40-44.

2. LIU Yongquan, "'Hanyu pinyin zhengcifa jiben guize' yinggai jinyibu wanshan ['The Basic Rules for Hanyu Pinyin Orthography' Should Be Further Perfected]," _Yuwen jianshe tongxun [Newsletter for the Construction of Language and Script]_, 65 (October, 2000), 1-6.

3. HU Baihua, "Ciyu pinyin dayoutiandi [Pinyin Divided into Words Has Great Scope]," _Yuwen jianshe tongxun [Newsletter for the Construction of Language and Script]_, 65 (October, 2000), 7-8.

All three of these articles are by prominent Chinese scholars and all three specifically espouse the correct use of pinyin aggregated into words. Items 2 and 3 appeared in a respected Hong Kong journal with close connections to language and script reform authorities in mainland China. Item 1, which appeared in what is arguably China's most influential linguistics journal, pointedly criticizes the Library of Congress for failing to make linguistic sense by not adhering to the established conventions for pinyin orthography.

Finally, just how vital the issue of pinyin in computers is became evident even to the layperson on February 1, 2001 with the publication in the _New York Times_ ("Circuits" section, G1, 8) of the article by Jennifer 8. (_sic_) Lee entitled "Where the PC Is Mightier Than the Pen: In China, Computer Use Erodes Traditional Handwriting, Stirring a Cultural Debate." While the title is self-evident, a glance at the article itself makes even clearer that the use of pinyin in China is growing by leaps and bounds precisely because of computers. Despite the PRC government's wrongheaded espousal of the incredibly hard Wubi Zixing code for entering characters, the overwhelming majority of the Chinese people simply ignore it and opt for the vastly easier pinyin method of inputting. Ms. Lee states that over 97% of the populace use pinyin for computer inputting. This accords with my own impression gained on frequent trips to China. It is apparent that even those who have been extensively trained in Wubi Zixing and other shape-based codes frequently resort to pinyin to call up characters and words whose codes they do not recall. Furthermore, Ms. Lee graphically illustrates how much faster it is to produce the correct characters for "Beijing" when one types the name in its aggregated form rather than as _bei_ (search among homophones) _jing_ (search among even more homophones again).

Incidentally, think how much harder it would be for you to make sense of the titles of the three articles in Chinese that I cited above had I not aggregated them into words but typed them thus: "Han yu pin yin zai bei mei di qu de tui xing: ji bei mei tu shu guan you wei shi yin biao xiang han yu pin yin de zhuan huan," _Zhongguo yu wen_. "'Han yu pin yin zheng ci fa ji ben gui ze' ying gai jin yi bu wan shan," _Yu wen jian she tong xun_. "Ci yu pin yin da you tian di," _Yu wen jian she tong xun_. Considering how simple it is to perform the correct aggregations and how much easier it is to read and search for Mandarin titles when strings of syllables are properly aggregated, I submit that it makes very poor sense to obstinately refuse to create orthographically correct bibliographical records. Furthermore, as I have pointed out repeatedly, even without taking into consideration the immediate benefits that would accrue to end-users from the application of properly aggregated pinyin, it is inevitable that pinyin records -- for the sake of efficiency in ordering, sorting, and searching (not to mention according with officially prescribed PRC practice) -- will eventually have to be aggregated. In the long run, it would be less costly to do the aggregation now than to wait and be forced to undertake a revision of hundreds of thousands of records later on.

Copyright 2001 Victor H. Mair.
Submitted to CLIEJ on 22 February 2001.