History and Prospect of Chinese Romanization

Bemjamin AO
Firsst Byte

I. History of Chinese Romanization

Chinese romanization refers to the phonetic representation of Chinese language material in the Roman alphabet. Sporadic romanization of Chinese words started way before the Renaissance, when westerners like Marco Polo came into contact with the Chinese culture and brought back Chinese goods like silk, tea, porcelain, etc. as well as stories about China's people, places and natural wonders.

Systematic Chinese romanization started in the early 17th century by Jesuit priests like Matteo Ricci (1552-1610, Italian), Nicolas Trigault (1577-1628, French) and others as they came to China to learn the Chinese language and to promote Christianity. Their efforts were later joined by other westerners. In 1867, Thomas F. Wade, the Chinese language secretary in the British embassy to China published a book called Yuyan Zi Er Ji (语言自迩集) ["Teach Yourself Chinese"], in which he used a romanization system adapted from the 1815 system by the English priest R. Morrison. Forty-five years later, H. A. Giles published his Chinese English Dictionary, in which he used Thomas Wade's romanization system with slight modifications. This is how the famous Wade-Giles system came about.

Realizing that the traditional Chinese writing system hampers China's mass education, many Chinese intellectuals launched a Chinese romanization movement of their own early this century. Fine scholars like Lu Xun (鲁迅), Hu Shi (胡适), Li Jinxi (黎锦熙), Lin Yutang (林语堂), Zhao Yuanren (赵元任), Qian Xuantong (钱玄同), etc., were all ardent supporters and participants of this movement. Numerous designs were proposed from 1897 onward. This effort culminated in 1928, when the Education Ministry of the nationalist government announced the Gwoyeu Romotzyh Pinin Faashyh, which was designed by Zhao Yuanren, the best linguist China has ever produced, and decreed that it would be used as the standard for Chinese romanization. Gwoyeu Romatzyh (国语罗马字) does not use any diacritic tone marks, but employs an ingeniously designed set of tonal spellings to mark tones. It has been very popular among teachers and students of the Chinese language in the US, though because of decades of war and turmoil and a general lack of foresight, it has never enjoyed much popular usage inside China.

At the same time, many communist revolutionaries were also sympathetic with the idea of romanization. In 1931, Qu Qiubai's Latinhua xinwenz (瞿秋白的拉丁化新文字) was adopted at a conference by Chinese nationals in the Soviet far east. This scheme has no provision for marking tones, and is therefore much simpler than Gwoyeu Romatzyh. It was used to educate workers in the Soviet far east, and was also extensively used in Yan'an (延安), the base and stronghold of the Chinese communist revolution. After 1949, the communist government set up a committee to study the issue of Chinese romanization. After years of deliberation, a new scheme called Hanyu Pinyin Fang'an (汉语拼音方案) was announced in 1958. For many years, Hanyu Pinyin was used mainly to replace Zhuyin Fuhao (注音符号) as a pronunciation guide in elementary education and in dictionaries. In the meantime, Wade-Giles continued to be used in English language publications until 1979, when the International Standard Organization passed a resolution adopting Hanyu Pinyin as the international standard for Chinese romanization.

In addition to the four romanization systems mentioned above, there is another Chinese romanization system called the Yale system. This system was created in 1948 and first used in a textbook called Speak Chinese. It was popular in some American institutions where Chinese was taught due to its relative simplicity.

In 1986, Taiwan's Education Ministry announced a simplified version of Gwoyeu Romatzyh. This system abandons the tonal spellings of Gwoyeu Romatzyh, and instead uses the same tone marks as Pinyin and Zhuyin Fuhao to mark tones.

None of the above mentioned romanization systems have any provision for distinguishing homophonous characters. But in 1991, this author published a simple Chinese word processor called CWP, which uses a modified version of Pinyin called CWP Pinyin. In addition to simplification of Pinyin and removal all the diacritical marks, this Pinyin system is capable of "spelling out" every single character.

Appendix I is a comparison of the seven romanization systems discussed above.

II. Nature of Romanization

In order to evaluate the various romanization systems by comparison, a little understanding of the nature of romanization is in order here.

Human speech consists of time-varying complex sound waves which can be divided into pieces called segments based on their acoustic properties. For example, there are two segments in the word "ba" and three in the word "ban". In addition, each segment may have a different pitch, length or intensity. These are called suprasegmentals. Segments and suprasegmentals that are different enough to cause a difference in meaning are called phonemes. For example, in Chinese, "ba" and "pa" have different meanings, which is caused by the difference between [b] and [p]. Therefore, [b] and [p] are two separate phonemes. Similarly, since "bi", "bo" and "ba" mean different things, the vowels [i], [o] and [a] are different segmental phonemes; and since "bi1" and "bi2" also mean different things, tone one and tone two are different suprasegmental phonemes as well.

Phonemes are what makes the human language understandable, and they are truly the building blocks of human language. Every human language in the world has no more than a few dozen phonemes to carry out its communicative functions. When each of these phoneme is represented with a symbol or combination of symbols drawn from an alphabet, the result is an alphabetic writing system. And if the alphabet happens to be the Roman alphabet, then the result is a Roman or romanized alphabetic writing system.

In comparison, the Chinese writing system takes the syllable as its basic unit of representation. Since every syllable consists of a vowel with a certain number of consonants, there are many more syllables than phonemes in any language. In standard Chinese, there are well over 1200 syllables. In order to represent them in writing, at least 1200 writing symbols are needed. Since each syllable can often be represented with different characters indicating different meanings, the actual number of Chinese character is a lot larger. The huge number of writing symbols one must learn in order to read and write Chinese is the chief culprit of the difficulty in learning and using the Chinese writing system.

III. Evaluation of Chinese Romanization Systems

If visually representing the phonemic system of a language with the Roman alphabet is the minimum requirement of any romanization system, then in principle any Chinese romanization system is as good as any other as long as it gives a truthful representation of the Chinese phonemic system. Thus, of the seven Chinese romanization systems listed in Appendix I, Beila is the only one that's inadequate, because by design it lacks representation for the tonal phonemes. All the other systems are theoretically sound, because they all have representations for all the Chinese phonemes (The CWP pinyin system distinguishes the contrast between zh, ch, sh and z, c, s not in the basic syllable form, but through the homonym identifiers.).

On the practical side, however, there is a difference in the way these systems represent the Chinese phonemic system.

When diacritic marks are used to distinguish phonemes, e.g. [u:] vs [u] in Wade-Giles and Pinyin, [ch'] vs [ch] in Wade-Giles, etc., they are often omitted either because it's troublesome or because it's impossible to use them. As a result, phonemic distinctions are lost, so that Lu Xiaojie (鲁) and Lu: Xiaojie (吕) are both spelled "Lu Hsiao Chieh", and Zhu Taitai (朱太太), Ju Taitai, Chu Taitai (褚太太) and Qu Taitai (曲太太) are all spelled "Chu Tai Tai", etc. When it comes to marking tones in Pinyin, it's the same situation, so that when tone marks are omitted, Shi1 Xiansheng, Shi2 Xiansheng (石先生), Shi3 Xiansheng (史先生) and Shi4 Xiansheng (侍先生) all become an identical "Shih Hsien Sheng". In library science, this means that a lot more irrelevant information will turn up when a subject or keyword search is requested.

The use of such diacritic marks also poses serious technical challenges. There are at least three variants of the aspiration sign in Wade-Giles, i.e. [ ' ], [ ' ] and [ ' ], and at least five alternative ways of writing u:, i.e. u:, yu, v, U or simply u. As for tone marks in Pinyin, so far there is no standard or widely-used coding scheme for them. Big5 has no provisions for them whatsoever, nor does any existing ASCII standard. GB defines the lower case characters but not the upper case ones. Unicode has only recently agreed to expand its international character set to include Pinyin characters with tone marks. Furthermore, even if there were a standard coding scheme for those characters, there is still no easy way to type them in.

Faced with these difficulties, people sometimes use numerals in lieu of tone marks to indicate tones. This is also the de facto tone marking method in Wade-Giles and Yale. However, since the numerals don't blend in well with the rest of the romanized text, they are easily conceived to be external to the syllable and are prone to errors or omission.

From the perspective of representational accuracy and ease of technical processing, Pinyin is only slightly better than Wade-Giles, while Gwoyeu Romatzyh and CWP Pinyin are by far superior to both standard Pinyin and Wade-Giles. Unfortunately, Gwoyeu Romatzyh is not as well-known as standard Pinyin, and neither is CWP Pinyin.

If Pinyin must be used as the new standard for romanized cataloguing of Chinese language materials, it should be used with tones marked. To get around the problem of diacritic marks, we may use numerals to mark tones, and we need to handle the diacritic mark in [u:].

IV. Technical Difficulties in Converting the Wade-Giles System

Once we decide to go ahead with converting Wade-Giles to Pinyin, we must be prepared to deal with many technical challenges.

First of all, we need to deal with mistaken or sloppy usage of Wade-Giles. This problem is especially difficult when such usage results in ambiguity. For example, if the syllable ch'u: is written as chu, then it will be quite difficult to correctly convert it to qu instead of zhu. Of course, if such usage is consistent, then we can still correct it relying on contextual information, but if it's occasional and not consistent, then it's very difficult to deal with.

Secondly, we need to identify and convert forms regularly used in Wade-Giles texts but do not conform to the Wade-Giles standard, such as Soochow, Peking, etc. We need an exception dictionary where all these irregular forms and their correct Pinyin correspondents are listed. To catch such forms that we fail to foresee, the converter must set up a flag whenever it comes across a form that doesn't conform to the Wade-Giles standard and is not listed in the exception dictionary. There can be quite a few such forms, and the amount of post-editing work can be tremendous, depending on how successful we deal with the next challenge.

And the next challenge is to identify non-Wade-Giles form that must not be converted. The problem here is that some such forms, including common English forms like China or Chinese and many Pinyin forms, may appear to be legitimate Wade-Giles forms to the automatic converter. Such forms need to be identified and skipped. This can be quite difficult, especially when there is not enough cue in the context. For example, if for whatever reason, an author's name 李匡 is entered in Pinyin form as Li Kuang, then how can one stop the automatic converter from assuming it's a Wade-Giles form and convert it into Li Guang?

Finally, if it's agreed that tone marking will be included in the Pinyin text converted from Wade-Giles, then getting that done can be quite a challenge. It's certainly possible to build an automatic converter with enough artificial intelligence to retrieve the lost tonal information from the context, especially if there are machine-readable accompanying texts in Chinese characters. However, to ensure a hundred percent accuracy in the conversion, a significant amount of time must be spent training the system.

Thus, a good Wade-Giles to Pinyin converter should be able to convert every Wade-Giles syllable to the corresponding Pinyin syllable, preferably with tones marked, and should keep a dictionary of exceptions so that forms that could have been converted are skipped and forms that could have been skipped are properly converted. Some degree of human intervention is needed in achieving that goal.

In addition to difficulties in the conversion from Wade-Giles, another thorny issue in Pinyin application is whether a Pinyin text should be written syllable by syllable, or with syllables aggregated into words. In principle, it's possible to write Pinyin texts syllable by syllable. The problem is that this practice would render keyword search very ineffective. For example, a search for "Dong Han" (东汉) would retrieve titles like "Yuan Dong Han Ying Da Ci Dian" (远东汉英大词典), which has nothing to do with "Dong Han". Admittedly, word segmentation is not an easy task. Although several manuals for word segmentation have been published2 [including RLG Chinese Aggregation Guidelines (The Research Libraries Group, Inc., 1987), Basic Rules of Chinese Pinyin Orthography (State Education Commission and State Language Commission, 1988), and Contemporary Chinese Language Word Segmentation Specification for Information Processing (National Standard, PRC, 1992)], none of them would cover every single detail in this process, due to the complex nature of modern Chinese morphology and syntax. Weighed against the improvement in searching effectiveness, the effort made in word segmentation is worthwhile. Actually, in most cases, different people can make identical decisions in word segmentation with the help of published manuals and reliable dictionaries. An automatic word segmenter using a reliable and editable dictionary can also be very useful.

V. Future of the Chinese Writing System and Chinese Romanization

As long as there are writing systems using the Roman alphabet in the world, there is always a need for Chinese romanization in areas like teaching Chinese as a foreign language, Chinese information cataloguing, etc. Whether a romanized writing system will replace the current Chinese writing system is something for our future generations to decide, perhaps hundreds of years down the road. Right now, it is very unlikely that this will happen any time soon.

However, the current status of the Chinese language is quite worrisome. It always takes a tremendous amount of time for anyone to master the Chinese language, yet as school curriculums change to accommodate the ever growing wealth of human knowledge, students are forced to spend less and less time on this subject. As a result, younger generations' proficiency in the Chinese language will continue to deteriorate, and the inheritance of the great Chinese cultural heritage is in serious trouble.

In the arena of teaching Chinese as a foreign language, the amount of time required to master the Chinese writing system scares away many potential learners of Chinese, and makes it difficult to make Chinese an international language.

Even more saddening is the status of Chinese in computer and information sciences. Because of the huge number of basic symbols in the Chinese character set, and because of the intrinsic incompatibility between Chinese characters and the keyboard, it's always easier to type English than Chinese even for people who are barely fluent in English. On the Internet, Chinese people discuss Chinese affairs or even the Chinese language in English. When Chinese speaking people are forced to write personal e-mail messages in English for their private communication inside China, something is seriously wrong.

There is no question that with the continuously growing wealth of English publications in science and technology worldwide, and in particular with the advent of Internet, English is quickly becoming the global lingua franca. As a result, it's understandable that people want to learn and practice English or even send their children to English only private schools, as is already happening in China now. However, if this signals a nationwide exodus from learning Chinese into using English, then the Chinese language and ultimately the preservation and inheritance of the Chinese culture is in grave danger.

When asked why Vietnamese instead of English is overwhelmingly used in soc.culture.vietnam, people maintaining this news group explained in their FAQ page:

Vietnamese is an integral part of the Vietnamese culture. To promote the Vietnamese language is to promote the Vietnamese culture. Besides, Vietnamese is the only medium that can connect all the Vietnamese people living in different parts of the world, and many foreign participants in this group join in order to see the Vietnamese language in action.

In contrast, Chinese participants in the soc.culture.china news group have been arguing with each other and calling each other names for almost ten years, always in English. This is truly shameful.

In order to better protect the Chinese cultural heritage, and in order to encourage Chinese speaking people to use the Chinese language as much as possible, I'm confident that the cause started by Hu Shi, Zhao Yuanren and their contemporaries in the 1920's will be carried on, and that our future generations will be able to choose between the traditional Chinese writing system and a romanized Chinese writing system in the best interest of the great Chinese nation and Chinese culture.

Appendix I

WG (1912)BL (1931)YL (1948)PY (1958)GR2 (1986)CWP (1991)GR (1928)
Note: The following table is best viewed with Internet Explorer. It seems Netscape browser can't display some of the diacritics correctly.
WG (1912)BL (1931)YL (1948)PY (1958)GR2 (1986)CWP (1991)GR (1928)
oooooo o

WG (1912)BL (1931)YL (1948)PY (1958)GR2 (1986)CWP (1991)GR (1928)
1__-hm-/n-/l-/r- = mh-/nh-/lh-/rh-
2//-xi=y/_V, i=y/_n, i=yi, u=w/_V, u=wu V=Vr (exc.=m-/n-/l-/r-)
3vv-via=yea/-ea, io=yeo/-eo, iu=yeu/-eu, ie=yee/-iee,=ua=woa/-oa, ue=woe/-oe, uo=woo/-uoo, u=wuu/-uu, au=ao, ai=ae, i=yii/-ii,=ou=oou, ei=eei, V=VV
Vng=Vnq, Vn=Vnn, V=Vh, Vi=Vy, Vu=Vw, iV=yV/-iV, i=yi/-i, uV=wV/-uV, u=wu/-u


Note:This paper was originally presented at the CALA Transliteration Task Force meeting in San Francisco on June 28, 1997.
Copyright © 1997 Benjamin Ao
Submitted to CLIEJ on 17 July 1997.