IDN basics

From IDNwiki

Jump to: navigation, search

Please note: The IDNwiki contains a number of articles that are targeted to readers who are unfamiliar with the technical details of IDNs and the applications in which they are used. This material is intended to be readily translatable into all of the languages supported here. It has therefore deliberately been kept free as possible of English-based abbreviations and acronyms that may lack equivalents in the other languages. All of the language facets are presented on an equal footing and links to documents available only in English have been avoided.

Additional material on any level of technical detail is no less welcome and, again, is anticipated in all of the IDNwiki languages. There is no expectation of all such articles being cross translated. Authors are free to illustrate them as appropriate with technical terminology and external references in any languages that serve the intended purpose.

Contents

Names and domain names

The word name is normally used to designate a far broader concept than the one meant by the term domain name. Although it is possible for a domain name to appear to be identical to a word in the dictionary of some language, or to a proper name, it is equally possible for a domain name deliberately to be an otherwise meaningless sequence of letters and digits. A domain names is only intended to be a readily memorable identifier in the Domain Name System (DNS). These identifiers can have word-like properties, but it does not follow from this that any word can be used in a domain name. This has not changed through the vast extension of the character repertoire that can be displayed in IDNs. Although there is a large degree of new latitude, it has not become possible for any word in any language to appear in a domain name, nor is every character that is used in every writing system available.

There are many reasons for this — some of which are completely external to the DNS and the IDN protocol — but the overriding concern is Internet security. As the issues that need further clarification becomes apparent in the general discussion on the IDNwiki, detailed articles on pivotal topics will be added.

Language and script

A written language is expressed using a collection of symbols called a script. A single script can be used for many languages, and may be designated by the name of one of them — the Arabic script is used for writing the Arabic language (and many others). A script may also have a name of its own — the Cyrillic script is used for writing the Russian language (among many others), but there is no language called Cyrillic.

The Unicode Character Code Chart, from which the IDN character repertoire has been selected, is based entirely on script. Although it includes some annotations about language-specific detail in the way these scripts are used, these do not provide a basis for recording and conveying the language identity of an IDN. This is a further source of potential confusion about what can be expected of the new facility. As noted above, as with all other domain names, IDNs are sequences of characters and digits that may be used as mnemonics, and may appear to be words, but have no intrinsic language identity can therefore not be measured against the contents of a dictionary.

The example.test names are intended to test the response of the networking environment to the appearance of a number of scripts, selected to cover the full range of detail that requires evaluation. The languages in which they are instantiated were selected in extension of this. The largest language written with a given script is not necessarily the one used here, if a smaller one illustrates a greater degree of technical complexity.

It is important to note that the evaluation domains will be terminated at the end of the evaluation period and that no registration will be possible in them while they are in operation. Any IDN TLDs that subsequently appear in ordinary production will be based on the scripts chosen by their prospective operators. The scripts used for the evaluation will have absolutely no effect on the latter selection, unless some technical frailty with one or more of the test scripts is revealed during the evaluation. It is for this reason that the test environment has deliberately been made as complex as possible.

The two forms of an IDN label

A label in a conventional domain name only contains letters from the basic Latin alphabet (a-z), digits (0-9), and a hypen (-). Although the DNS supports several additional characters, top-level domain (TLD) registries normally enforce the letter-digit-hyphen (LDH) restriction. An "LDH-label" is stored in the DNS in precisely the same form as it is displayed to users; a domain name that appears on a computer screen as example.test will also be registered and stored as example.test.

An IDN label may contain a far larger range of additional characters, but this is done through special encoding that separates the displayed and stored forms of a label. For example, a Cyrillic representation of example.test, displayed as пример.испытание, is stored as xn--e1afmkfd.xn--80akhbyknj4f. The displayed form is termed a "U-label" (from "Unicode") and the stored formed is an "A-label" (from the "ASCII Compatible Encoding" scheme on which IDNs are based).

IDN-aware software

In the context under consideration here, the DNS itself deals only with A-labels (which conform fully to the LDH rule and thus require no modification to the underlying platform), and the conversion to and from U-labels is performed by applications software. If an application is "IDN aware" it will do this transparently to the user. If not, it is likely that A-labels will be displayed exclusively. Although this should not interfere with the underlying functionality, the appearance of cryptic labels instead of the expected legible ones has clear potential for causing both disappointment and confusion. Some IDN-aware applications also deliberately display A-labels to call attention to situations that the software developer feels require some form of warning. There is no uniform approach to this deliberate display of A-labels and details about specific implementations can be found through the sidebar link labeled "Software".

Communication issues

There has been extensive discussion about whether the A-label or the U-label is the primary registered object. It is true that only the former actually appears in the DNS, but it is equally true that a name holder's intention is based on the latter, as are user expectations. It therefore seems entirely reasonable to treat both labels as fully equivalent in all administrative regards. In actual practice, however, there are contexts in which one form may be used in preference to the other, and a name holder should be prepared to indicate either, as appropriate to the immediate purpose of that communication.

There are three closely related situations where this is likely to be of particular concern. The first is when registering an IDN through an interface that does not support the script in which the U-label is written, or places some other limitation on conversion to A-label format. The second is when needing to enter an IDN via a keyboard on which the U-label script does not appear, for example, in an Internet café in a region with local keyboards that do not support it. The third is when communicating an IDN to someone who will have difficulty in keyboarding the U-label in their own working environment. None of these problems attach to the utility of IDNs in a click path, but all suggest need for the holder of an IDN to be able to indicate it in A-label form for fallback purposes.

Indicating IDNs in HTML documents

One of the most common applications of IDNs is in the identification of Web pages, where they frequently appear as "clickable links". The full array of available characters may be used in the display of a link, but its "anchor" — which determines the destination to which the user will be taken — can only be validly represented in A-label form in an HTML document. Although this may cause the name to be displayed in an unexpected manner in the address and status lines of a browser, the format has the clear advantage of not requiring the browser to be IDN-aware in order to retrieve the intended resource:

  • <a href="http://xn--hxajbheg2az3al.xn--jxalpdlp">http://παράδειγμα.δοκιμή</a>
Personal tools