@

CURRENT STATUS OF VIETNAMESE LANGUAGE PROCESSING

AND MULTILINGUAL PROCESSING

Country Report - Vietnam

@

Revised Version (September 28, 1997)

Global Information Infrastructure (GII)

for Equal Language Opportunity (MLIT '97)

Tokyo, 07-08 November 1997

@

@

Tran Luu Chuong, Nguyen Hoang, Ngo Thanh Nhan,

Do Ba Phuoc, Ngo Trung Viet

ITSC

@

@

@

1 Diffusion of Computer, especially of Personal Computers in Vietnam

Starting from 1990, working practice in Vietnam has changed among offices where reports and documents are created electronically by individual personal computers through keyboard and mouse. Only from 1995 the PC started to be connected into networks according to different projects of the National Program on Information Technology. A nation-wide information network is being to establish in 1996 and will be finished by end of 1997. In general, Vietnam has not Information Technology legacy because of its backward and the isolation with the world during period 1960-1990. Vietnam has no mainframe, but a few old mini computers. The most computers used now are PCs.

Currently there are no official statistics on computer park in Vietnam and several sources show very different statistics. As reported by the Vietnam National Steering Committee on IT, till the end of 1996, personal computers in Vietnam amounted to 240,000 units, half of which was made by reputable firms (Compaq, IBM, Apple, Hewlett-Packard, Acer, NEC, Fujitsu) covering 78% of the total value. The other 22% came from computers assembled locally with components manufactured in Southeast Asia. The computer percentage reached 3.2 units/1,000 persons, equivalent to that of Indonesia in few years ago (3/1,000), yet far in comparison with those in Malaysia (11/1,000) and Singapore (80/1,000).

As estimated by the Ministry of Trade, the hardware value touch US$100 million in 1995; US$150 million in 1996 while the correlative figures were 100 and 140 million, according to International Data Corporation (IDC)- Asia Pacific. An estimation of 70,000 computers in 1995 and 140,000 in 1996 were assembled in Vietnam.

Till early 1997, the computer structure is as follows:

- State-owned bodies and businesses : 75%

- Institute and non-civil branches: 10%

- Administrative units : 10%

- Households : 5%

The networking ratio of PC to the total number of PC are estimated to be 10% in 1995; 25% in 1996 and 70% by end of 1997.

During 1980fs Vietnam has implemented 2 consecutive national IT application programs. The first, 1980-1985 concentrated on developing its own 8-bit microcomputer prototypes. The second, 1986-1990 concentrated on developing IT applications based on microcomputers and began its researches on IT standardisation. The current programme is called National Programme on Information Technology, promulgated in 1993 and with its Steering Committee belonging directly to the Primer Ministry. The programme promotes IT applications in the governmental sector as well as private sector and to develop Vietnam IT infrastructure and to build the base for IT industry.

2 Status of Vietnamese Language Solutions

@

Vietnam is a country with more than 50 ethnic peoples. Many of them have their own scripts, for example Thai people has their scripts like Thailand script but with some differences, Cham people with script like Indic script, Tay people with ideographic script, etc.. With the development of new technology, especially microcomputers, many ethnic peoples are interested to use their native language and script on computers.

Vietnam had used Chu Nom, ideographic script, for near 10 centuries and at the beginning of this century Vietnam changed to use Quoc ngu, a Latin-based script, after more than 300 years of its development by Western missionaires and Vietnamese intellectuals . In 1945 Quoc ngu became an official language in Vietnam but the government also encourages the minorities to use their native language.

Today there appears a new movement of restoring the ancient treasure of literature and history. Almost of Vietnamese literature and history works before 1920 were written in ideographs and with the hard climate, and time, these books are being damaged. So Vietnam has intention to transfer all these ancient treasury into computer for better preserving. And it is hoped that with the Chu Nom in computers, many people can reuse Chu Nom in their works, for example in studying the ancient literature, in making their genealogy, using in rituals at pagodas...

Vietnam has also more than 2 million people now living in whole the world. There is a need to use Vietnamese in communication between Vietnamese through the world. So Vietnam like to be able to exchange Vietnamese texts mixing with other languages, especially on computers and through computer networks, the Internet.

@

@

3 Status of Application Software in Vietnam

@

Currently not all user interfaces of the application software is localised for Vietnamese language. Since 1980, many Vietnamese IT companies have developed and they contributed their efforts in localisation of some foreign software. But the fact that at the time there had not existed a national standard on Vietnamese character code set made the situation more seriously for the localisation processes. In fact, every domestic company made its own code table for its localisation, and until before 1991, there existed about 30 code sets for Vietnamese. Moreover many US-based Vietnamese companies created also their code tables implemented in their word processing software which were accepted and used inside the country, especially in South Vietnam.

The above code tables can be divided into two groups:

(a) One-byte code table : 3C30 (3C Corporation), Cyrillix, VW1, VSCII, Daisy, 2FONT, Vietkey, ACAD, ABC etc ... These code tables based on encoding for the all extern vision glyphs, regardless on linguistic essentials.

(b) Two-byte code table : 3C25, VW2, ATM2, VNij, VNI, etc ... These code table based on encoding for generic parts of the language.

This has made the situation more chaotic and critical.

Facing with this situation, the National IT Programs has made great attention on IT standardisation activities, firstly on Vietnamese code set. Some conferences on Vietnamese code sets had been held since 1985 under the support from National IT Programs and UNESCO. In 1991, a task force on gStandard Vietnamese code tableh had been set up by the Ministry of Science and Technology, and as result, a draft for Vietnamese Standard Code Set for Information Interchange had been developed and after that approved in 1993 as the first national standard in IT - TCVN 5712:1993.

In the same 1993, to meet the new growing needs on IT standardisation, the General Department for Standardisation, Metrology and Quality Control (GDSMQC) of the Ministry of Science, Technology and Environment (MoSTE) has established the Technical Committee on Information Technology TCVN/JTC1 responsible for IT standardisation. TCVN/JTC1 has two-year term and includes Vietnamese computer scientists, from academic and industrial sectors, inside and outside the country.

In 1995, with the establishing of the Steering Committee of the National Program on IT (SCNPIT), the Information Technology Standard Sub-Committee (ITSC) was also established by this Steering Committee and worked closely with TCVN/JTC1.

TCVN/JTC1 focuses on some strategic, long-term IT standardisation issues while ITSC is interested in some concrete, short-term issues and some IT standardisation problems related to implementation of the National Program on IT. The separation and co-ordination of TCVN/JTC1 and ITSC works are important both in addressing immediate standard-conformed software needs in the public and in long term continuity of data stored and interchanged in national databases.

TCVN/JTC1 and ITSC participate on some international standardisation activities as in SC2, SC2/WG2/IRG, SC2/WG2 and SC2/WG3. TCVN/JTC1 and ITSC also keep the good relation with Microsoft, IBM, Oracle, Adobe... in their activities on localisation in Vietnam.

By recommendations of ITSC adopted by SCNPIT and MoSTE, all software used in Vietnamese administration must be localised with Vietnamese language capability, although user interface can be maintained in English. That means user can process their information in Vietnamese (Quoc ngu) under Windows environment with Word, Excel, Powerpoint, Access and so on ... But all application software produced in the country must have user interface in Vietnamese.

Although officially speaking the software localisation must be realised on basis of TCVN 5712, but there exist also many software based on other code page than TCVN 5712. Now the MoSTE is in process of taking necessary measures to finish with this chaotic situation.

@

@

4 Strategies on IT localisation

We understand that there usually exist some differences between formal and de facto standards and that is not good for the development of IT. So we try to keep compatible between formal and de facto standards in our environment. It means that in all standardisation activities, the role of industrial people is considered important as academic people and IT standardisation staff should include these people.

TCVN/JTC1 and ITSC develop good co-operation with IT companies in localisation, it is one of our strategies on IT localisation. By promulgating some governmental official decisions on using standards, IT domestic companies are requested to conform standards firstly in the governmental sector and then encourage other sectors to follow. Since 1993, when the national standard on Vietnamese code set TCVN 5712:1993 was promulgated, until 1995, some standard-conformed software for Vietnamese word processing were developed: ABC, CADPRO,...

Beside that, TCVN/JTC1 and ITSC keep close relation with some main IT international companies such as Microsoft, IBM, Adobe... TCVN/JTC1 and ITSC organised some meetings with these companies and discussing about our support for their localisation activities. Some fundamental works had been done before the localisation could be realised: code page for Vietnamese, a Vietnamese keyboard layout, a Vietnamese localisation algorithm set, such as date, time, length, volume and weight measurements, a conversion between the Gregorian (Western) and Lunar (Vietnamese)calendars, a set of Vietnamese icon names, dialogue boxes in Vietnamese, together with a glossary of Vietnamese and English terms for all graphic user interfaces (GUIs), a sorting algorithm for Vietnamese, which is now universal for Latin based scripts with accents, a spell-checking algorithm and spell-checking dictionary for monosyllabic Latin Vietnamese etc...

As results of these activities, Microsoft has developed a code page for Vietnamese called cp 1258 since 10/1996. And now the localisation are completed for official releases of Windows 95, Office 97 Vietnamese edition since July this year. The next localisation of Microsoft for Vietnamese is Memphis and Office 98.

We discussed with IBM, Vietnam and Canada, in preparing IBM code page for Vietnamese to be used in their operating system AIX. The IBM has adopted a new code page, named cp 1129, released 3/1997, for Vietnamese. The code page differs Microsoftfs code page cp 1258 at 3 code points to meet the demand of including all French characters. Until the end of this year, IBM will release Vietnamese version for AIX operating system. And in the plan, IBM will continue localise AS 400 for Vietnamese.

Works have been much done with Adobe to correct our national standards on Chu Nom and making outline fonts for Chu Nom. Chu Nom is our ideographic script and is being put into ISO/IEC 10646.

@

@

5 Standards for Vietnamese and Multi-lingual processing

@

Character code set for Quoc ngu (Latin based characters) - TCVN 5712:1993

Vietnamese languages processing in computers is the most interest of IT specialists in Vietnam as well as in the world. Before 1990, all proposals for Vietnamese code set is based on the precomposed encoding scheme in 8-bit environment. And that make a lot of difficulties to keep compatible between Vietnamese and English and other requirements from 8-bit ISO standards. For representing completely Vietnamese vision glyphs, 134 more precomposed forms are needed to be encoded while it rests only 128 codepoints for the last haft of 8-bit code set. Of course, in 16-bit environment, there is no problem because of all Vietnamese characters have been encoded in Unicode and ISO/IEC 10646, but scattered in several plans. A lot of discussions have been made and the conclusion is that for precomposed encoding scheme, there is no way to keep Vietnamese compatible with ISO 8859. A local alternative can be used with precomposed encoding scheme but it does not meet our strategy of keeping Vietnamese in compatible with international and regional standardisation.

The solution to the national standard TCVN 5712:1993 on Vietnamese is theoretically leaned towards the elegance of combining method, actually consists of three tables (VN1,VN2 and VN3), addressing backward compatibility for all current encoding schemes and software in the public. Now the TCVN 5712:1993 is recognised by the MoSTE as an official code table for using in all government sectors. The MoSTE also encourages the private sectors to follow this code table.

Draft of new TCVN standard, Vietnamfs proposal for 8859-V

Actually, TCVN/JTC1 is considering a proposal on a new 8-bit Vietnamese code set by ITSC. The code set is full using combined method, no backward compatibility for other encoding scheme and can keep Vietnamese coexisting with English, French, Germany, Spanish, Sweden... (in comparison to TCVN 5712:1993 keeping Vietnamese coexisting only with English). To get the national standard status, the draft must be passed the standardisation process. A technical group has been established to focus on implementing new technologies of the combining methods and encouraging their applications.

The main advantage of this proposal is that every Vietnamese character with tone mark is represented by composing the base character and its tone mark, the others are still precomposed. With combining method, for completely representing Vietnamese, we need only more 14 codes for 6 vowels and 1 consonant (both in upper and lower cases), and 5 codes for tone marks. So all of these characters can be arranged into G1 and left rooms for characters from other languages like English, French, Germany, Spanish... And the whole code table complies with the architecture of ISO 8859-x.

At the same time, Vietnam has put forward its proposal, called 8858-V, to ISO/IEC/SC2/WG3 to consider as a new international standard for 8-bit characters set, based on ISO 8859 but with combining method. The reason for this action is that Vietnam thinks there exists also some needs of other languages to use 8-bit code set with combining method, such as Thailand or Arabs. Vietnam is invited to present its proposal to SC2 on this matter.

Character code set for Chu Nom (ideographic characters)

Two Chu Nom 16-bit standard code sets had been promulgated, TCVN 5773:1993 and TCVN 6056:1995. Both covers about 6,000 characters, in which Nom proper characters are near 1700. The two standards are being revised in conjunction to IRG works. A new standard for Chu Nom is being developed which cover more 3000 characters.

Character code set for Cham script

At present, it is a proposal to ISO/IEC/SC2/WG2. But in fact, Cham script has been introduced into microcomputers and used to print some text books, dictionaries.

@

Keyboard

Vietnam has developed a national standard on Vietnamese keyboard named TCVN 6064:1995. This keyboard layout bases on the international keyboard layout ISO 9995 and overloading some key strokes to express Vietnamese proper characters (more 6 vowels and 1 consonant, plus 5 tone marks). Vietnamese proper characters will replace some traditional characters in the international keyboards layout ISO 9995 and put them into other levels (the third and fourth levels). Proposal from IBM for Vietnamese keyboard refines more characters for using with French characters.

Beside the national standard on the keyboard layout, the most popular input methods are TELEX and VNI for Quoc ngu. TELEX is a input method that had been accepted in the TCVN 6064:1995. It makes no change in the US keyboard layout but applying a rule in typing for making Vietnamese characters, for example : â is made by 2 consecutive typings of a. VNI is name of a software very popular in using Vietnamese on microcomputers. Because it was used for a long time and people are used to use its method to enter Vietnamese characters.

For Chu Nom, the input method bases on using Quoc ngu as an intermedium. Vietnamese words are typed in Quoc ngu and then a list of Nom characters will be displayed to permit user to choose which is suitable. But until now the method is still in process of implemention and is not yet used widely.

Font

Fonts for Quoc ngu: Vietnamese fonts are made based on international fonts for English. Vietnamese font producers add more Vietnamese tone marks and some modifications into existing fonts. In fact, there not exist a professional font producer.

Fonts for Chu Nom: at present fonts for Chu Nom exist in bitmap for TCVN 5773:1993 and TCVN 6056:1995. Outline fonts for these standards are being developed in co-operating with Adobe.

Font for Chu Cham exists only in bitmap and prototype.

Ordering for Quoc ngu

Ordering for Quoc ngu is defined in the national standard TCVN 5712:1993, not only based on characters but also based on tone marks.

Bi-lingual and multi-lingual environment

In fact, there exists a bi-lingual environment for Vietnamese and English. Every application used in Vietnam must run for both Vietnamese and English. The traditional approach to this problem is to make drivers for Vietnamese (keyboard and fonts) and then to attach them into the English operating systems with a switch to turn them on / off. The standard approach to this problem, realised by international IT manufactures, is the localisation process. All localised system is intended to meet the characteristics of Vietnamese and the bi-lingual requirements. But the problem of multi-lingual environment is not yet totally solved. Chu Nom is not yet a component of the computation environment although it is a real demand. By some reasons, the problem is not yet in the interest of domestic neither international IT companies in Vietnam. In present situation of Vietnam, we have no ability to overcome the difficulty unless joint to the efforts of many countries with the same situation to address and solve the problem in a larger and broader manner.

A working group, including the leading IT companies in the country such as Microsoft, IBM, Oracle, FPT, Batin, LacViet, TD&T, VietSoft..., has been set up from September under ITSC to push forward efforts on creating of multi-lingual environment,. The first task of the group is to define clearly, and then implement, problems of Vietnamese and Information Technology and, in a general manner, to specify the requirements of multi-lingual environment and form objectives for the future development of IT activities in Vietnamese and in languages. A draft of specification on Vietnamese and Information Technology is proposed to discuss, in which a part is dedicated for creating a multi-lingual environments.

@

6 Multilingual Information Technology for Vietnam and other countries

@

Multi-scripts and multi-lingual information processing environment is in fact a urgent need for Vietnam, on the one hand. Until now we have much difficulty in creating such that environment. IT companies in the country put their interests only on actual writing system Quoc ngu. Although the need from historical and literature fields in using Chu Nom is great and with the big supports from the government in restoring and preserving the ancient treasury, Vietnam lacks techniques and environment for multi-script and multi-lingual information processing. And that also limits the impact of information technology to many ethnic peoples in Vietnam. There exist some proposals and projects on the matter, such as a project proposal from the Toyota foundation on creating a multi-lingual environment, including Vietnamese, Japanese, Chinese. But due to the large scale of the problem, it cannot be resolved by single efforts from any companies without thinking deeper and broader on the very fundamental issues such as those of MLIT.

On the other hand, the need for information interchanges between countries in the region will certainly increase in the near future in pace of increasing economic development. So a multi-lingual information technology environment is quite necessary for the development and the equality between countries. And the good starting point would be a standardisation for native languages in harmony with other countries in the region to find out a rational framework for other activities. An organisation like MLIT symposium is quite necessary and useful.

The Multi-lingual Information Technology symposium is a good opportunity for Vietnam to share and to join to the regional interests in IT standardisation and also meet the urgent need of Vietnam. Vietnam highly appreciates the initiative and will participate actively in these standardisation activities.

Hanoi, revised on September 1997

@


___

___