Talk:IDN basics
From IDNwiki
[edit] Indicating IDNs in HTML documents
In the IDN basics article you wrote:
- One of the most common application of IDNs is the identification of Web pages. These identifiers frequently appear embedded in other Web pages, where they can be indicated in two ways. The first requires a fully IDN-aware browser and places U-labels in both the anchor and display portions of the URL, for example as,
-
- <a href="http://παράδειγμα.δοκιμή">http://παράδειγμα.δοκιμή</a>
- This is the form used in the second column of the table on the main page of the IDNwiki.
That's not correct, the shown construct is no URL as specified in RFC 3986 also known as STD 66. It's an "IRI" (Internationalized Resource Identifier), specified in RFC 3987.
Such "unencoded" IRIs (UTF-8 or legacy charsets) cannot be used on ordinary HTML pages, the well-known HTML document types allow only URIs. Likewise they can't be used on valid XHTML 1.0 pages.
For the third and last example you noted:
- It is also possible for the anchor to consist entirely of LDH labels:
-
- <a href="http://el.idn.icann.org">http://ελληνικά.idn.icann.org</a>
While that's true I think it muddies the water here, the A-labels in the anchor of the second example also consist entirely of LDH labels. As explained in the two forms of an IDN label an A-label is an LDH label. How about adding a reference to RFC 3696 somewhere? --Frank 10:15, 24 October 2007 (UTC)
- My inclination is to render the article correct by simplifying the terminology, in preference to increasing its technical rigor. There are additional RFCs that would otherwise be worth citing where, for example, you will note that an "A-label" is not to be seen as a type of LDH-label. I'd be delighted to see a parallel discussion of the underlying concepts in real hardcore terms, but think it would be best kept separate from the introductory presentation. If you'd care to suggest headings for one or several such articles, I will set them up for collective editing. ----Cary 10:34, 24 October 2007 (UTC)
IMO you don't need hardcore explanations here, the relevant IDNA and IRI RFCs like RFC 3987 already offer this. Actually I proposed to reference RFC 3696 for an explanation of "LDH" because it's not "hardcore", quite the contrary, it's an "informative RFC" perfectly readable for any user who can read English texts. It's okay if you don't like RFC 3696, my main point is that your URL terminology isn't simplified, it's erroneous and misleading. At some point in time (X)HTML validators will start to flag "unencoded" IRIs as invalid URIs.
For one (X)HTML validator that point in time was yesterday, it now says invalid. I recall the day when the W3C validator started to identify one of my pages as "invalid", I was seriously annoyed, and it took me months to figure out what was wrong.
With respect to correct terminology in "non-hardcore" articles check out Richard's IDN and IRI article, I think it's readable. --Frank 11:07, 24 October 2007 (UTC)
- The URIs in the tables on both the main page and the e-mail test page have now been stripped of all unpermitted characters. --Cary 10:00, 22 November 2007 (UTC)
[edit] IDNAbis
In the IDN status report somebody wrote about the IDNAbis drafts:
- Documents such as these (known in the IETF as "Internet Drafts") are frequently updated. Updates result in changes of the number at the end of the name. The ICANN IDN pages will be kept up to date with links to the current versions as they evolve.
The author could use the generic URIs pointing to the latest drafts, adding the new IDNAbis protocol draft. --Frank 16:51, 3 November 2007 (UTC)
[edit] Toplabel
After a quick scan of some RFCs and drafts mentioned on the new IDNAbis page I think that so far no RFC bothered to update the following statement in RFC 1123 (labelled as discussion in section 2.1):
- However, a valid host name can never have the dotted-decimal form #.#.#.#, since at least the highest-level component label will be alphabetic.
Another name for highest-level component label is <toplabel>, and several RFCs tried to specify a syntax for it.
- RFC 2396 tried LDH starting with <alpha>. It was obsoleted by RFC 3986, but various document types still reference RFC 2396.
- RFC 4408 tried LDH based on RFC 3696, i.e. "not all-numeric".
- RFC.usefor (waiting for its number) tried LDH based on RFC 3696 adding "not a single ALPHA".
A quick fix could be to submit an 1123-erratum quoting RFC 3696 as evidence:
- There is an additional rule that essentially requires that top-level domain names not be all-numeric.
--Frank 16:51, 3 November 2007 (UTC)
Update: I've now submitted this issue as an erratum. --Frank 10:14, 20 November 2007 (UTC)
