NLS(7) Miscellaneous Information Manual NLS(7)
NAME
NLSNative Language Support Overview
DESCRIPTION
Native Language Support (NLS) provides commands for a single worldwide operating system base. An internationalized system has no built-in assumptions or dependencies on language-specific or cultural-specific conventions such as:
 
Character classifications
Character comparison rules
Character collation order
Numeric and monetary formatting
Date and time formatting
Message-text language
Character sets
 
All information pertaining to cultural conventions and language is obtained at program run time.
 
“Internationalization” (often abbreviated “i18n”) refers to the operation by which system software is developed to support multiple cultural-specific and language-specific conventions. This is a generalization process by which the system is untied from calling only English strings or other English-specific conventions. “Localization” (often abbreviated “l10n”) refers to the operations by which the user environment is customized to handle its input and output appropriate for specific language and cultural conventions. This is a specialization process, by which generic methods already implemented in an internationalized system are used in specific ways. The formal description of cultural conventions for some country, together with all associated translations targeted to the native language, is called the “locale”.
 
NetBSD provides extensive support to programmers and system developers to enable internationalized software to be developed. NetBSD also supplies a large variety of locales for system localization.
Localization of Information
All locale information is accessible to programs at run time so that data is processed and displayed correctly for specific cultural conventions and language.
 
A locale is divided into categories. A category is a group of language-specific and culture-specific conventions as outlined in the list above. ISO C specifies the following six standard categories supported by NetBSD:
 
LC_COLLATE
string-collation order information
LC_CTYPE
character classification, case conversion, and other character attributes
LC_MESSAGES
the format for affirmative and negative responses
LC_MONETARY
rules and symbols for formatting monetary numeric information
LC_NUMERIC
rules and symbols for formatting nonmonetary numeric information
LC_TIME
rules and symbols for formatting time and date information
 
Localization of the system is achieved by setting appropriate values in environment variables to identify which locale should be used. The environment variables have the same names as their respective locale categories. Additionally, the LANG, LC_ALL, and NLSPATH environment variables are used. The NLSPATH environment variable specifies a colon-separated list of directory names where the message catalog files of the NLS database are located. The LC_ALL and LANG environment variables also determine the current locale.
 
The values of these environment variables contains a string format as:
 
language[_territory][.codeset][@modifier]
 
Valid values for the language field come from the ISO639 standard which defines two-character codes for many languages. Some common language codes are:
 
Language Name
Code
Language Family
ABKHAZIAN
AB
IBERO-CAUCASIAN
AFAN (OROMO)
OM
HAMITIC
AFAR
AA
HAMITIC
AFRIKAANS
AF
GERMANIC
ALBANIAN
SQ
INDO-EUROPEAN (OTHER)
AMHARIC
AM
SEMITIC
ARABIC
AR
SEMITIC
ARMENIAN
HY
INDO-EUROPEAN (OTHER)
ASSAMESE
AS
INDIAN
AYMARA
AY
AMERINDIAN
AZERBAIJANI
AZ
TURKIC/ALTAIC
BASHKIR
BA
TURKIC/ALTAIC
BASQUE
EU
BASQUE
BENGALI
BN
INDIAN
BHUTANI
DZ
ASIAN
BIHARI
BH
INDIAN
BISLAMA
BI
BRETON
BR
CELTIC
BULGARIAN
BG
SLAVIC
BURMESE
MY
ASIAN
BYELORUSSIAN
BE
SLAVIC
CAMBODIAN
KM
ASIAN
CATALAN
CA
ROMANCE
CHINESE
ZH
ASIAN
CORSICAN
CO
ROMANCE
CROATIAN
HR
SLAVIC
CZECH
CS
SLAVIC
DANISH
DA
GERMANIC
DUTCH
NL
GERMANIC
ENGLISH
EN
GERMANIC
ESPERANTO
EO
INTERNATIONAL AUX.
ESTONIAN
ET
FINNO-UGRIC
FAROESE
FO
GERMANIC
FIJI
FJ
OCEANIC/INDONESIAN
FINNISH
FI
FINNO-UGRIC
FRENCH
FR
ROMANCE
FRISIAN
FY
GERMANIC
GALICIAN
GL
ROMANCE
GEORGIAN
KA
IBERO-CAUCASIAN
GERMAN
DE
GERMANIC
GREEK
EL
LATIN/GREEK
GREENLANDIC
KL
ESKIMO
GUARANI
GN
AMERINDIAN
GUJARATI
GU
INDIAN
HAUSA
HA
NEGRO-AFRICAN
HEBREW
HE
SEMITIC
HINDI
HI
INDIAN
HUNGARIAN
HU
FINNO-UGRIC
ICELANDIC
IS
GERMANIC
INDONESIAN
ID
OCEANIC/INDONESIAN
INTERLINGUA
IA
INTERNATIONAL AUX.
INTERLINGUE
IE
INTERNATIONAL AUX.
INUKTITUT
IU
INUPIAK
IK
ESKIMO
IRISH
GA
CELTIC
ITALIAN
IT
ROMANCE
JAPANESE
JA
ASIAN
JAVANESE
JV
OCEANIC/INDONESIAN
KANNADA
KN
DRAVIDIAN
KASHMIRI
KS
INDIAN
KAZAKH
KK
TURKIC/ALTAIC
KINYARWANDA
RW
NEGRO-AFRICAN
KIRGHIZ
KY
TURKIC/ALTAIC
KURUNDI
RN
NEGRO-AFRICAN
KOREAN
KO
ASIAN
KURDISH
KU
IRANIAN
LAOTHIAN
LO
ASIAN
LATIN
LA
LATIN/GREEK
LATVIAN
LV
BALTIC
LINGALA
LN
NEGRO-AFRICAN
LITHUANIAN
LT
BALTIC
MACEDONIAN
MK
SLAVIC
MALAGASY
MG
OCEANIC/INDONESIAN
MALAY
MS
OCEANIC/INDONESIAN
MALAYALAM
ML
DRAVIDIAN
MALTESE
MT
SEMITIC
MAORI
MI
OCEANIC/INDONESIAN
MARATHI
MR
INDIAN
MOLDAVIAN
MO
ROMANCE
MONGOLIAN
MN
NAURU
NA
NEPALI
NE
INDIAN
NORWEGIAN
NO
GERMANIC
OCCITAN
OC
ROMANCE
ORIYA
OR
INDIAN
PASHTO
PS
IRANIAN
PERSIAN (farsi)
FA
IRANIAN
POLISH
PL
SLAVIC
PORTUGUESE
PT
ROMANCE
PUNJABI
PA
INDIAN
QUECHUA
QU
AMERINDIAN
RHAETO-ROMANCE
RM
ROMANCE
ROMANIAN
RO
ROMANCE
RUSSIAN
RU
SLAVIC
SAMOAN
SM
OCEANIC/INDONESIAN
SANGHO
SG
NEGRO-AFRICAN
SANSKRIT
SA
INDIAN
SCOTS GAELIC
GD
CELTIC
SERBIAN
SR
SLAVIC
SERBO-CROATIAN
SH
SLAVIC
SESOTHO
ST
NEGRO-AFRICAN
SETSWANA
TN
NEGRO-AFRICAN
SHONA
SN
NEGRO-AFRICAN
SINDHI
SD
INDIAN
SINGHALESE
SI
INDIAN
SISWATI
SS
NEGRO-AFRICAN
SLOVAK
SK
SLAVIC
SLOVENIAN
SL
SLAVIC
SOMALI
SO
HAMITIC
SPANISH
ES
ROMANCE
SUNDANESE
SU
OCEANIC/INDONESIAN
SWAHILI
SW
NEGRO-AFRICAN
SWEDISH
SV
GERMANIC
TAGALOG
TL
OCEANIC/INDONESIAN
TAJIK
TG
IRANIAN
TAMIL
TA
DRAVIDIAN
TATAR
TT
TURKIC/ALTAIC
TELUGU
TE
DRAVIDIAN
THAI
TH
ASIAN
TIBETAN
BO
ASIAN
TIGRINYA
TI
SEMITIC
TONGA
TO
OCEANIC/INDONESIAN
TSONGA
TS
NEGRO-AFRICAN
TURKISH
TR
TURKIC/ALTAIC
TURKMEN
TK
TURKIC/ALTAIC
TWI
TW
NEGRO-AFRICAN
UIGUR
UG
UKRAINIAN
UK
SLAVIC
URDU
UR
INDIAN
UZBEK
UZ
TURKIC/ALTAIC
VIETNAMESE
VI
ASIAN
VOLAPUK
VO
INTERNATIONAL AUX.
WELSH
CY
CELTIC
WOLOF
WO
NEGRO-AFRICAN
XHOSA
XH
NEGRO-AFRICAN
YIDDISH
YI
GERMANIC
YORUBA
YO
NEGRO-AFRICAN
ZHUANG
ZA
ZULU
ZU
NEGRO-AFRICAN
 
For example, the locale for the Danish language spoken in Denmark using the ISO 8859-1 character set is da_DK.ISO8859-1. The da stands for the Danish language and the DK stands for Denmark. The short form of da_DK is sufficient to indicate this locale.
 
The environment variable settings are queried by their priority level in the following manner:
 
If the LC_ALL environment variable is set, all six categories use the locale it specifies.
If the LC_ALL environment variable is not set, each individual category uses the locale specified by its corresponding environment variable.
If the LC_ALL environment variable is not set, and a value for a particular LC_* environment variable is not set, the value of the LANG environment variable specifies the default locale for all categories. Only the LANG environment variable should be set in /etc/profile, since it makes it most easy for the user to override the system default using the individual LC_* variables.
If the LC_ALL environment variable is not set, a value for a particular LC_* environment variable is not set, and the value of the LANG environment variable is not set, the locale for that specific category defaults to the C locale. The C or POSIX locale assumes the ASCII character set and defines information for the six categories.
Character Sets
A character is any symbol used for the organization, control, or representation of data. A group of such symbols used to describe a particular language make up a character set. It is the encoding values in a character set that provide the interface between the system and its input and output devices.
 
The following character sets are supported in NetBSD:
ASCII
The American Standard Code for Information Exchange (ASCII) standard specifies 128 Roman characters and control codes, encoded in a 7-bit character encoding scheme.
ISO 8859 family
Industry-standard character sets specified by the ISO/IEC 8859 standard. The standard is divided into 15 numbered parts, with each part specifying broad script similarities. Examples include Western European, Central European, Arabic, Cyrillic, Hebrew, Greek, and Turkish. The character sets use an 8-bit character encoding scheme which is compatible with the ASCII character set.
Unicode
The Unicode character set is the full set of known abstract characters of all real-world scripts. It can be used in environments where multiple scripts must be processed simultaneously. Unicode is compatible with ISO 8859-1 (Western European) and ASCII. Many character encoding schemes are available for Unicode, including UTF-8, UTF-16 and UTF-32. These encoding schemes are multi-byte encodings. The UTF-8 encoding scheme uses 8-bit, variable-width encodings which is compatible with ASCII. The UTF-16 encoding scheme uses 16-bit, variable-width encodings. The UTF-32 encoding scheme using 32-bit, fixed-width encodings.
Font Sets
A font set contains the glyphs to be displayed on the screen for a corresponding character in a character set. A display must support a suitable font to display a character set. If suitable fonts are available to the X server, then X clients can include support for different character sets. xterm(1) includes support for Unicode with UTF-8 encoding. xfd(1) is useful for displaying all the characters in an X font.
 
The NetBSD wscons(4) console provides support for loading fonts using the wsfontload(8) utility. Currently, only fonts for the ISO8859-1 family of character sets are supported.
Internationalization for Programmers
To facilitate translations of messages into various languages and to make the translated messages available to the program based on a user's locale, it is necessary to keep messages separate from the programs and provide them in the form of message catalogs that a program can access at run time.
 
Access to locale information is provided through the setlocale(3) and nl_langinfo(3) interfaces. See their respective man pages for further information.
 
Message source files containing application messages are created by the programmer and converted to message catalogs. These catalogs are used by the application to retrieve and display messages, as needed.
 
NetBSD supports two message catalog interfaces: the X/Open catgets(3) interface and the Uniforum gettext(3) interface. The catgets(3) interface has the advantage that it belongs to a standard which is well supported. Unfortunately the interface is complicated to use and maintenance of the catalogs is difficult. The implementation also doesn't support different character sets. The gettext(3) interface has not been standardized yet, however it is being supported by an increasing number of systems. It also provides many additional tools which make programming and catalog maintenance much easier.
Support for Multi-byte Encodings
Some character sets with multi-byte encodings may be difficult to decode, or may contain state (i.e., adjacent characters are dependent). ISO C specifies a set of functions using 'wide characters' which can handle multi-byte encodings properly. The behaviour of these functions is affected by the LC_CTYPE category of the current locale.
 
A wide character is specified in ISO C as being a fixed number of bits wide and is stateless. There are two types for wide characters: wchar_t and wint_t. wchar_t is a type which can contain one wide character and operates like 'char' type does for one character. wint_t can contain one wide character or WEOF (wide EOF).
 
There are functions that operate on wchar_t, and substitute for functions operating on 'char'. See wmemchr(3) and towlower(3) for details. There are some additional functions that operate on wchar_t. See wctype(3) and wctrans(3) for details.
 
Wide characters should be used for all I/O processing which may rely on locale-specific strings. The two primary issues requiring special use of wide characters are:
All I/O is performed using multibyte characters. Input data is converted into wide characters immediately after reading and data for output is converted from wide characters to multi-byte encoding immediately before writing. Conversion is controlled by the mbstowcs(3), mbsrtowcs(3), wcstombs(3), wcsrtombs(3), mblen(3), mbrlen(3), and mbsinit(3).
Wide characters are used directly for I/O, using getwchar(3), fgetwc(3), getwc(3), ungetwc(3), fgetws(3), putwchar(3), fputwc(3), putwc(3), and fputws(3). They are also used for formatted I/O functions for wide characters such as fwscanf(3), wscanf(3), swscanf(3), fwprintf(3), wprintf(3), swprintf(3), vfwprintf(3), vwprintf(3), and vswprintf(3), and wide character identifier of %lc, %C, %ls, %S for conventional formatted I/O functions.
SEE ALSO
BUGS
This man page is incomplete.