On Pinyin Conversion

James K. LIN
Harvard University

Word Division

Many people have commented on the merits of word division. Yet few have provided a clear definition on what constitutes Chinese "word division". The "RLG Chinese Aggregation Guidelines" now being used by RLIN is to "parse strings of Chinese characters into semantic units. These aggregated units will then become terms indexed by the RLIN system". RLG has been very clever in avoiding the use of "word" in its definition I cited above. They knew the "Guideline" are not "word division" guidelines. Rather, they are "phrase division" guidelines.

What is the definition of Chinese word division, then? There are many answers to the question. Yet there has been no one clear acceptable definition that will satisfy all. Here I'll provide one example from Paul Kratochvil's "The Chinese Language Today": "An MSC (modern standard Chinese) is the smallest unit which may function as an immediate constituent of MSC segmental sentences." He continued: "That words in MSC can be established only on the basis of actual structural features of MSC utterances, and that it is next to impossible to set up MSC words, or rather to find words in MSC by using the gauge of the set of features popularly associated with words in European languages." He further indicated that" The point is not that features of meaning have nothing to do with the status of MSC words, but that meaning is only a partial aspect of the total behavior of the smallest operational units called words, and as such it is not a property sufficient for establishing these units."

A popular myth about Pinyin is that Pinyin and the word division are twin: You can not have one without the other. In a LC policy statement dated January 3, 1991, on "LC Position on the Use of Pinyin Romanization", it stated that:" A major difficulty impeding the adoption of Pinyin is the lack of standards for word division." This is the typical common mistaken notion about the relationship between Pinyin and the word division.

Both the Pinyin and the Wade-Giles are the symbols or systems of symbols used for representing pronunciation of Chinese characters. It is in the sphere of "phonetics" when we talk about one phonetic system or the other, very much the same way Websters or International Phonetic Alphabet systems are to the pronunciation of English. The word division is the study of combining Chinese characters into the equivalent of "words" in European languages. It falls into the sphere of "morphology", the study of words. There are at least four or five popular systems for the pronunciation of Chinese. The Pinyin and the Wade-Giles are two of them. The American East-Asian libraries have used Wade-Giles system for decades, without word divisions.

The People's Republic of China adopted the Pinyin system in 1958. A guideline on the "proper ways to write (i.e., link)" Chinese names in Pinyin form was introduced in 1976. It basically stipulates that surname and forename should be separated by space and forename should be joined together, and other specifications. In the same year, a guideline for writing place names was also published. It was not until 1982 that a guideline on how to write Pinyin form in a text was announced.

The discussion on how to link various romanized forms can be traced back as early as in the 1920's, long before Pinyin's time.

I want to emphasize here that the guidelines use the term "Zheng4 Ci2 Fa3". It literally means "orthography." In other words, they are the guidelines on when and how to write the Pinyin forms. The guidelines take the syntactic structures into consideration, which usually link elements larger than what we normally would call "word". Because of the fact that syntactic structures vary from one utterance to another, the guidelines leave a lot of room for personal interpretations.

Most people failed to mention that the guidelines are applicable only to the Modern Common Chinese. They are not intended to govern the Classical Chinese, where the syntactic structures are vastly different from that of Modern Common Chinese.

I'll outline my key points in the following:

  1. It's possible to adopt the Pinyin system without word division, especially when Chinese characters are provided in the parallel fields.

  2. Chinese personal and place names are proper nouns that are readily identifiable and are naturally bound units in their own merits. To have them linked will not cause confusion.

  3. Do not forget that it took us long time to get the Chinese characters into the bibliographic records. Chinese characters ought to be the center of the focus, not the auxiliary romanized systems. Karen Smith-Yoshimura of RLG made a fine remark on this point in her E-mail to East Asian libraries earlier this year.

  4. For the sake of computer indexing, RLIN may continue to use "Aggregation Guidelines", while OCLC may want to strengthen the "proximity searching" capacity. They have cooperated very well so far. There is no reason to believe the same way can not be done in the Pinyin environment.

  5. The statements made by the National Library of Australia (NAL) regarding the syllable aggregation practice of Pinyin are the best I have read so far. They did not mention "word division", thus eliminating confusion as to what is word division from the outset. Meanwhile, they recognized the distinct bound features in personal and place names.

"That, where a cataloguer inputs Pinyin data into the National CJK system, each Chinese character should be input as one Pinyin syllable, except for proper and geographic names, where the syllables should be joined."

Authority Files

Name and subject authority records must be converted to Pinyin prior to the initiation day for Pinyin conversion. Since LC maintains these two files, LC has the obligation to carry out the conversion on these two files. These are the absolute prerequisites for Day 1 operation. Any attempt (e.g., changing the authority records only when they are related to the works being cataloged) to "shortcut" these requirements will definitely bring catastrophic consequence to the East Asian libraries around the world. Some people compared this practice to the conversion from AACR I to AACR II. I believe those who entertain such an idea have underestimated the complexity of Pinyin conversion.

Two observations:

  1. Conventional names have to be identified and not be converted in the name authority file. Machine conversion will have difficulties in dealing with such situations. The experience of the National Library of Australia is a good example.

  2. Subject heading formats are based on the "sources found" (H202, H203). In earlier days, the majority of the sources are in Wade-Giles. Therefore, most of the vernacular subject headings are in Wade-Giles form, with variations in diacritical marks. Other forms also slipped in for the same reason. In recent years, some headings were formulated based on Pinyin sources. For example: China--History--Tiananmen Square Incident, 1989. In addition, there are conventional vernacular terms that have been accepted as part of the English usage, even though they are not in Pinyin form. To sum up, LCSH contains at least the following types:

    1. Pure W-G forms (or combined with English)
    2. W-G forms with no diacritical marks
    3. Conventional form of personal and place names
    4. W-G form of terms being accepted as English
    5. Pinyin forms (or combined with English)
    6. Other forms that are neither W-G nor Pinyin

The ambiguity of the rules governing the vernacular subject heading proposals contribute to the "mixed bag" situation. Catalogers have been confused and concerned with the situation.

Two questions are posted here:

  1. How will LC deal with the rules governing the vernacular subject heading proposals in the future? Since Wade-Giles forms still exist in many important reference books, do we still adopt those Wade-Giles vernacular terms after the conversion?

  2. How to deal with the existing headings that contained Wade-Giles and other forms? The National Library of Australia did not do well in this regard.

Bibliographic Records

Ideally, it will be best to have all past bibliographic records converted to Pinyin forms. But for the sake of implementation of Pinyin, I personally could live with the idea of split files for the time being until either LC or the two utilities convert them later. It will be ineffective and inefficient to expect individual institutions to convert bibliographic records by themselves one at a time.

Classification Schedules

How to deal with the classification schedules that are exclusively derived from the Wade-Giles is another matter of great concern. We have to come to agreeable solutions prior to the conversion.

We have to determine whether to freeze the call numbers or not.

If we do not freeze the old numbers, then we will resort to using "see references" all the time, and will end up with relying on both Pinyin and Wade-Giles side by side forever. The cataloging staff thereby will have the added burden to deal with the two systems.

If freezing the old numbers is the way to go, and no references will be made to the old numbers generated by Wade-Giles forms, then there will be the need to establish new schedules in areas where schedules are arranged by the Wade-Giles forms, especially the numbers for the 20th century authors in PL2735.5-PL2929.5. Every existing author will probably end up with two author numbers, one based on the Wade-Giles and the other, Pinyin. Not too many names or subjects can stay in the same numbers. Cuttering is not only determined by the first letter in a word, the second and third letters will also influence the outcome of the Cuttering. Also it is not just the surnames that will be the determining factors in Cuttering; the forenames will also play important roles. The determining factors differ from caption to caption.

Two areas that will be greatly affected by the conversion are:

  1. Chinese history schedule: DS731 (individual elements in the population), DS793 (Provinces, dependencies, etc.), DS796 (Cities), and individual biographies in more than 35 areas in Chinese history schedule alone. Similar changes will also apply to Taiwan history schedules.

  2. Chinese literature schedules: Chinese literary genres and terms, topics related to places, names, and all individual literary author numbers and works (PL2663-PL3079) will be affected. As I mentioned earlier, the 20th century Chinese authors' numbers are arranged not only according to the Wade-Giles forms, but also take into account the frequency of occurrence in Wade-Giles forms. They are not just the mere A-Z arrangement. For example, PL2834 Cha, PL2835 Chai, PL2836 Chan, PL2843 Chia, PL2851 Chou. Cha will become Zha in Pinyin, Chai will become Zhai, Chan will become Zhan, and Chou will become Zhou. Other examples are: Yen is in PL2925. It will be in PL2921.5 as Yan. Currently PL2924 is for Yeh. It will have no use at all. There is no provision for the initial "X", which will be in great demand for Pinyin.

In addition to the DS and PL schedules, other schedules will also be affected when names and vernacular terms are used for Cuttering purposes.

Further, LC needs to replace the old numbers and their captions based on Wade-Giles with the new numbers and captions based on Pinyin in all the affected areas, and have them printed on the classification schedule before Day 1. Otherwise, outside libraries will have nothing to work on except the blank A-Z captions. Only LC can determine the class and the subject Cutter numbers to replace those old ones. There are differences between the subject and book Cutter numbers. The former is part of the class numbers, while the latter can be assigned by the outside libraries.

Linkage Issues

For literary authors, LC's practice has been to link the AACR II names to the numbers derived from AACR I names. For example, Hsiao, Hung, 1911-1942, the AACR II name, classed in PL2740.N3, the number derived from the AACR I name, Chang, Nai-ying, 1911-1942. The Pinyin format for the author will become Xiao, Hong. Shall the author be retained in PL2740.N3, or reclassed according to the Pinyin form?

Commentaries on individual persons or works are supposed to class with the original call numbers (F570), which in most cases, are based on Wade-Giles. If we freeze the old numbers, we have to establish new numbers based on the Pinyin forms for authors and works before we can proceed to catalog any commentary works.

New Shelflisting Methods

We understand that LC has been looking for a new shelflisting procedure, which will be very different from the current practice. We would like to know to what extent that will impact the call numbers. If we carry out the Pinyin conversion task ahead of the new shelflisting procedure, we certainly will suffer another major setback in the Pinyin-based call numbers. We can not afford to keep on abandoning and changing our call numbers every so often.

Other Priorities

At the time when library budget and cataloging staff are facing deep cuts throughout the U.S. libraries (LC included), we certainly can not afford to do everything we want to do at the same time. We have to sort out our priorities and plan very carefully and wisely on how to allocate our limited resources.

In an April 3rd E-mail to Philip Melzer, the Chairperson of CEAL Technical Processing Committee, Mr. Eugene Wu, the Librarian of Harvard-Yenching Library, listed several top priorities facing the U.S. East Asian libraries today. They are: retrospective conversion, CJK local OPAC system, CJK characters in the Name Authority File. I can add another one: integrated library online system capable of handling CJK scripts.


The National Library of Australia is to be complimented for its commitment and professionalism as evident in its fine statements about Pinyin syllable linkage and its conversion of all its Chinese bibliographic records to Pinyin form. If there is any lesson to be learned, I believe that is: Say only what you can deliver, and deliver what you have already said.

This article was originally posted on eastlib on June 17, 1997. These are the author's personal viewpoints on Pinyin conversion. They do not represent the institutional position on the issue.
Copyright © 1997 James K. Lin
Submitted to CLIEJ on 17 June 1997.