doc/rfc/rfc4690.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075

Network Working Group                                         J. Klensin
Request for Comments: 4690                                  P. Faltstrom
Category: Informational                                    Cisco Systems
                                                                 C. Karp
                                       Swedish Museum of Natural History
                                                                     IAB
                                                          September 2006


  Review and Recommendations for Internationalized Domain Names (IDNs)

Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2006).

Abstract

   This note describes issues raised by the deployment and use of
   Internationalized Domain Names.  It describes problems both at the
   time of registration and for use of those names in the DNS.  It
   recommends that IETF should update the RFCs relating to IDNs and a
   framework to be followed in doing so, as well as summarizing and
   identifying some work that is required outside the IETF.  In
   particular, it proposes that some changes be investigated for the
   Internationalizing Domain Names in Applications (IDNA) standard and
   its supporting tables, based on experience gained since those
   standards were completed.

Table of Contents

   1. Introduction ....................................................3
      1.1. The Role of IDNs and This Document .........................3
      1.2. Status of This Document and Its Recommendations ............4
      1.3. The IDNA Standard ..........................................4
      1.4. Unicode Documents ..........................................5
      1.5. Definitions ................................................5
           1.5.1. Language ............................................6
           1.5.2. Script ..............................................6
           1.5.3. Multilingual ........................................6
           1.5.4. Localization ........................................7
           1.5.5. Internationalization ................................7


Klensin, et al.              Informational                      [Page 1]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


      1.6. Statements and Guidelines ..................................7
           1.6.1. IESG Statement ......................................8
           1.6.2. ICANN Statements ....................................8
   2. General Problems and Issues ....................................11
      2.1. User Conceptions, Local Character Sets, and Input issues ..11
      2.2. Examples of Issues ........................................13
           2.2.1. Language-Specific Character Matching ...............13
           2.2.2. Multiple Scripts ...................................13
           2.2.3. Normalization and Character Mappings ...............14
           2.2.4. URLs in Printed Form ...............................16
           2.2.5. Bidirectional Text .................................17
           2.2.6. Confusable Character Issues ........................17
           2.2.7. The IESG Statement and IDNA issues .................19
   3. Migrating to New Versions of Unicode ...........................20
      3.1. Versions of Unicode .......................................20
      3.2. Version Changes and Normalization Issues ..................21
           3.2.1. Unnormalized Combining Sequences ...................21
           3.2.2. Combining Characters and Character Components ......22
           3.2.3. When does normalization occur? .....................23
   4. Framework for Next Steps in IDN Development ....................24
      4.1. Issues within the Scope of the IETF .......................24
           4.1.1. Review of IDNA .....................................24
           4.1.2. Non-DNS and Above-DNS Internationalization
                  Approaches .........................................25
           4.1.3. Security Issues, Certificates, etc. ................25
           4.1.4. Protocol Changes and Policy Implications ...........27
           4.1.5. Non-US-ASCII in Local Part of Email Addresses ......27
           4.1.6. Use of the Unicode Character Set in the IETF .......27
      4.2. Issues That Fall within the Purview of ICANN ..............28
           4.2.1. Dispute Resolution .................................28
           4.2.2. Policy at Registries ...............................28
           4.2.3. IDNs at the Top Level of the DNS ...................29
   5. Specific Recommendations for Next Steps ........................29
      5.1. Reduction of Permitted Character List .....................29
           5.1.1. Elimination of All Non-Language Characters .........30
           5.1.2. Elimination of Word-Separation Punctuation .........30
      5.2. Updating to New Versions of Unicode .......................30
      5.3. Role and Uses of the DNS ..................................31
      5.4. Databases of Registered Names .............................31
   6. Security Considerations ........................................31
   7. Acknowledgements ...............................................32
   8. References .....................................................32
      8.1. Normative References ......................................32
      8.2. Informative References ....................................33


Klensin, et al.              Informational                      [Page 2]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


1.  Introduction

1.1.  The Role of IDNs and This Document

   While IDNs have been advocated as the solution to a wide range of
   problems, this document is written from the perspective that they are
   no more and no less than DNS names, reflecting the same requirements
   for use, stability, and accuracy as traditional "hostnames", but
   using a much larger collection of permitted characters.  In
   particular, while IDNs represent a step toward an Internet that is
   equally accessible from all languages and scripts, they, at best,
   address only a small part of that very broad objective.  There has
   been controversy since IDNs were first suggested about how important
   they will actually turn out to be; that controversy will probably
   continue.  Accessibility from all languages is an important
   objective, hence it is important that our standards and definitions
   for IDNs be smoothly adaptable to additional scripts as they are
   added to the Unicode character set.

   The utility of IDNs must be evaluated in terms of their application
   by users and in protocols: the ability to simply put a name into the
   DNS and retrieve it is not, in and of itself, important.  From this
   point of view, IDNs will be useful and effective if they provide
   stable and predictable references -- references that are no less
   stable and predictable, and no less secure, than their ASCII
   counterparts.

   This combination of objectives and criteria has proven very difficult
   to satisfy.  Experience in developing the IDNA standard and during
   the initial years of its implementation and deployment suggests that
   it may be impossible to fully satisfy all of them and that
   engineering compromises are needed to yield a result that is
   workable, even if not completely satisfactory.  Based on that
   experience and issues that have been raised, it is now appropriate to
   review some of the implications of IDNs, the decisions made in
   defining them, and the foundation on which they rest and determine
   whether changes are needed and, if so, which ones.

   The design of the DNS itself imposes some additional constraints.  If
   the DNS is to remain globally interoperable, there are specific
   characteristics that no implementation of IDNs, or the DNS more
   generally, can change.  For example, because the DNS is a global
   hierarchal administrative namespace with only a single name at any
   given node, there is one and only one owner of each domain name.
   Also, when strings are looked up in the DNS, positive responses can
   only reflect exact matches: if there is no exact match, then one gets
   an error reply, not a list of near matches or other supplemental
   information.  Searches and approximate matchings are not possible.


Klensin, et al.              Informational                      [Page 3]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   Finally, because the DNS is a distributed system where any server
   might cache responses, and later use those cached responses to
   attempt to satisfy queries before a global lookup is done, every
   server must use the same matching criteria.

1.2.  Status of This Document and Its Recommendations

   This document reviews the IDN landscape from an IETF perspective and
   presents the recommendations and conclusions of the IAB, based
   partially on input from an ad hoc committee charged with reviewing
   IDN issues and the path forward (see Section 7).  Its recommendations
   are advice to the IETF, or in a few cases to other bodies, for topics
   to be investigated and actions to be taken if those bodies, after
   their examinations, consider those actions appropriate.

1.3.  The IDNA Standard

   During 2002, the IETF completed the following RFCs that, together,
   define IDNs:

   RFC 3454  Preparation of Internationalized Strings ("Stringprep")
      [RFC3454].
      Stringprep is a generic mechanism for taking a Unicode string and
      converting it into a canonical format.  Stringprep itself is just
      a collection of rules, tables, and operations.  Any protocol or
      algorithm that uses it must define a "Stringprep profile", which
      specifies which of those rules are applied, how, and with which
      characteristics.

   RFC 3490  Internationalizing Domain Names in Applications (IDNA)
      [RFC3490].
      IDNA is the base specification in this group.  It specifies that
      Nameprep is used as the Stringprep profile for domain names, and
      that Punycode is the relevant encoding mechanism for use in
      generating an ASCII-compatible ("ACE") form of the name.  It also
      applies some additional conversions and character filtering that
      are not part of Nameprep.

   RFC 3491  Nameprep: A Stringprep Profile for Internationalized Domain
      Names (IDN) [RFC3491].
      Nameprep is designed to meet the specific needs of IDNs and, in
      particular, to support case-folding for scripts that support what
      are traditionally known as upper- and lowercase forms of the same
      letters.  The result of the Nameprep algorithm is a string
      containing a subset of the Unicode Character set, normalized and
      case-folded so that case-insensitive comparison can be made.


Klensin, et al.              Informational                      [Page 4]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   RFC 3492  Punycode: A Bootstring encoding of Unicode for
      Internationalized Domain Names in Applications (IDNA) [RFC3492].
      Punycode is a mechanism for encoding a Unicode string in ASCII
      characters.  The characters used are the same the subset of
      characters that are allowed in the hostname definition of DNS,
      i.e., the "letter, digit, and hyphen" characters, sometimes known
      as "LDH".

1.4.  Unicode Documents

   Unicode is used as the base, and defining, character set for IDNs.
   Unicode is standardized by the Unicode Consortium, and synchronized
   with ISO to create ISO/IEC 10646 [ISO10646].  At the time the RFCs
   mentioned earlier were created, Unicode was at Version 3.2.  For
   reasons explained later, it was necessary to pick a particular,
   then-current, version of Unicode when IDNA was adopted.
   Consequently, the RFCs are explicitly dependent on Unicode Version
   3.2 [Unicode32].  There is, at present, no established mechanism for
   modifying the IDNA RFCs to use newer Unicode versions (see
   Section 3.1).

   Unicode is a very large and complex character set.  (The term
   "character set" or "charset" is used in a way that is peculiar to the
   IETF and may not be the same as the usage in other bodies and
   contexts.)  The Unicode Standard and related documents are created
   and maintained by the Unicode Technical Committee (UTC), one of the
   committees of the Unicode Consortium.

   The Consortium first published The Unicode Standard [Unicode10] in
   1991, and continues to develop standards based on that original work.
   Unicode is developed in conjunction with the International
   Organization for Standardization, and it shares its character
   repertoire with ISO/IEC 10646.  Unicode and ISO/IEC 10646 function
   equivalently as character encodings, but The Unicode Standard
   contains much more information for implementers, covering -- in depth
   -- topics such as bitwise encoding, collation, and rendering.  The
   Unicode Standard enumerates a multitude of character properties,
   including those needed for supporting bidirectional text.  The
   Unicode Consortium and ISO standards do use slightly different
   terminology.

1.5.  Definitions

   The following terms and their meanings are critical to understanding
   the rest of this document and to discussions of IDNs more generally.
   These terms are derived from [RFC3536], which contains additional
   discussion of some of them.


Klensin, et al.              Informational                      [Page 5]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


1.5.1.  Language

   A language is a way that humans interact.  The use of language occurs
   in many forms, including speech, writing, and signing.

   Some languages have a close relationship between the written and
   spoken forms, while others have a looser relationship.  RFC 3066
   [RFC3066] discusses languages in more detail and provides identifiers
   for languages for use in Internet protocols.  Computer languages are
   explicitly excluded from this definition.  The most recent IETF work
   in this area, and on script identification (see below), is documented
   in [RFC4645] and [RFC4646].

1.5.2.  Script

   A script is a set of graphic characters used for the written form of
   one or more languages.  This definition is the one used in
   [ISO10646].

   Examples of scripts are Arabic, Cyrillic, Greek, Han (the so-called
   ideographs used in writing Chinese, Japanese, and Korean), and
   "Latin".  Arabic, Greek, and Latin are, of course, also names of
   languages.

   Historically, the script that is known as "Latin" in Unicode and most
   contexts associated with information technology standards is known in
   the linguistic community as "Roman" or "Roman-derived".  The latter
   terminology distinguishes between the Latin language and the
   characters used to write it, especially in Republican times, from the
   much richer and more decorated script derived and adapted from those
   characters.  Since IDNA is defined using Unicode and that standard
   used the term "LATIN" in its character names and descriptions, that
   terminology will be used in this document as well except when
   "Roman-derived" is needed for clarity.  However, readers approaching
   this document from a cultural or linguistic standpoint should be
   aware that the use of, or references to, "Latin script" in this
   document refers to the entire collection of Roman-derived characters,
   not just the characters used to write the Latin language.  Some other
   issues with script identification and relationships with other
   standards are discussed in [RFC4646].

1.5.3.  Multilingual

   The term "multilingual" has many widely-varying definitions and thus
   is not recommended for use in standards.  Some of the definitions
   relate to the ability to handle international characters; other
   definitions relate to the ability to handle multiple charsets; and
   still others relate to the ability to handle multiple languages.


Klensin, et al.              Informational                      [Page 6]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   While this term has been deprecated for IETF-related uses and does
   not otherwise appear in this document, a discussion here seemed
   appropriate since the term is still widely used in some discussions
   of IDNs.

1.5.4.  Localization

   Localization is the process of adapting an internationalized
   application platform or application to a specific cultural
   environment.  In localization, the same semantics are preserved while
   the syntax or presentation forms may be changed.

   Localization is the act of tailoring an application for a different
   language or script or culture.  Some internationalized applications
   can handle a wide variety of languages.  Typical users understand
   only a small number of languages, so the program must be tailored to
   interact with users in just the languages they know.

   Somewhat different definitions for localization and
   internationalization (see below) are used by groups other than the
   IETF.  See [W3C-Localization] for one example.

1.5.5.  Internationalization

   In the IETF, the term "internationalization" is used to describe
   adding or improving the handling of non-ASCII text in a protocol.
   Other bodies use the term in other ways, often with subtle variation
   in meaning.  The term "internationalization" is often abbreviated
   "i18n" (and localization as "l10n").

   Many protocols that handle text only handle the characters associated
   with one script (often, a subset of the characters used in writing
   English text), or leave the question of what character set is used up
   to local guesswork (which leads to interoperability problems).
   Adding non-ASCII text to such a protocol allows the protocol to
   handle more scripts, with the intention of being able to include all
   of the scripts that are useful in the world.  It is naive (sic) to
   believe that all English words can be written in ASCII, various
   mythologies notwithstanding.

1.6.  Statements and Guidelines

   When the IDNA RFCs were published, the IESG and ICANN made statements
   that were intended to guide deployment and future work.  In recent
   months, ICANN has updated its statement and others have also made
   contributions.  It is worth noting that the quality of understanding
   of internationalization issues as applied to the DNS has evolved


Klensin, et al.              Informational                      [Page 7]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   considerably over the last few years.  Organizations that took
   specific positions a year or more ago might not make exactly the same
   statements today.

1.6.1.  IESG Statement

   The IESG made a statement on IDNA [IESG-IDN]:

      IDNA, through its requirement of Nameprep [RFC3491], uses
      equivalence tables that are based only on the characters
      themselves; no attention is paid to the intended language (if any)
      for the domain name.  However, for many domain names, the intended
      language of one or more parts of the domain name actually does
      matter to the users.

      Similarly, many names cannot be presented and used without
      ambiguity unless the scripts to which their characters belong are
      known.  In both cases, this additional information should be of
      concern to the registry.

   The statement is longer than this, but these paragraphs are the
   important ones.  The rest of the statement consists of explanations
   and examples.

1.6.2.  ICANN Statements

1.6.2.1.  Initial ICANN Guidelines

   Soon after the IDNA standards were adopted, ICANN produced an initial
   version of its "IDN Guidelines" [ICANNv1].  This document was
   intended to serve two purposes.  The first was to provide a basis for
   releasing the Generic Top Level Domain (gTLD) registries that had
   been established by ICANN from a contractual restriction on the
   registration of labels containing hyphens in the third and fourth
   positions.  The second was to provide a general framework for the
   development of registry policies for the implementation of IDNs.

   One of the key components of this framework prescribed strict
   compliance with RFCs 3490, 3491, and 3492.  With the framework, ICANN
   specified that IDNA was to be the sole mechanism to be used in the
   DNS to represent IDNs.

   Limitations on the characters available for inclusion in IDNs were
   mandated by two mechanisms.  The first was by requiring an
   "inclusion-based approach (meaning that code points that are not
   explicitly permitted by the registry are prohibited) for identifying
   permissible


Klensin, et al.              Informational                      [Page 8]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   code points from among the full Unicode repertoire."  The second
   mechanism required the association of every IDN with a specific
   language, with additional policies also being language based:

   "In implementing the IDN standards, top-level domain registries will
   (a) associate each registered internationalized domain name with one
   language or set of languages,
   (b) employ language-specific registration and administration rules
   that are documented and publicly available, such as the reservation
   of all domain names with equivalent character variants in the
   languages associated with the registered domain name, and,
   (c) where the registry finds that the registration and administration
   rules for a given language would benefit from a character variants
   table, allow registrations in that language only when an appropriate
   table is available. ...  In implementing the IDN standards, top-level
   domain registries should, at least initially, limit any given domain
   label (such as a second-level domain name) to the characters
   associated with one language or set of languages only."

   It was left to each TLD registry to define the character repertoire
   it would associate with any given language.  This led to significant
   variation from registry to registry, with further heterogeneity in
   the underlying language-based IDN policies.  If the guidelines had
   made provision for IDN policies also being based on script, a
   substantial amount of the resulting ambiguity could have been
   avoided.  However, they did not, and the sequence of events leading
   to the present review of IDNA was thus triggered.

1.6.2.2.  ICANN Version 2 Guidelines

   One of the responses of the TLD registries to what was widely
   perceived as a crisis situation was to invoke the mechanism described
   in the initial guidelines: "As the deployment of IDNs proceeds, ICANN
   and the IDN registries will review these Guidelines at regular
   intervals, and revise them as necessary based on experience."

   The pivotal requirement was the modification of the guidelines to
   permit script-based policies for IDNs.  Further concern was expressed
   about the need for realistically implementable mechanisms for the
   propagation of TLD registry policies into the lower levels of their
   name trees.  In addition to the anticipated increase of constraint on
   the protocol level, one obvious additional approach would be to
   replace the guidelines by an instrument that itself had clear status
   in the IETF's normative framework.  A BCP was therefore seen as the
   appropriate focus for longer-term effort.  The most pressing issues
   would be dealt with in the interim by incremental modification to the
   guidelines, but no need was seen for the detailed further development
   of those guidelines once that incremental modification was complete.


Klensin, et al.              Informational                      [Page 9]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   The outcome of this action was a version 2.0 of the guidelines
   [ICANNv2], which was endorsed by the ICANN Board on November 8, 2005
   for a period of nine months.  The Board stated further that it "tasks
   the IDN working group to continue its important work and return to
   the board with specific IDN improvement recommendations before the
   ICANN Meeting in Morocco" and "supports the working group's continued
   action to reframe the guidelines completely in a manner appropriate
   for further development as a Best Current Practices (BCP) document,
   to ensure that the Guideline directions will be used deeper into the
   DNS hierarchy and within TLD's where ICANN has a lesser policy
   relationship."

   Retaining the inclusion-based approach established in version 1.0,
   the crucial addition to the policy framework is that:

   "All code points in a single label will be taken from the same script
   as determined by the Unicode Standard Annex #24: Script Names at
   http://www.unicode.org/reports/tr24.  Exception to this is
   permissible for languages with established orthographies and
   conventions that require the commingled use of multiple scripts.  In
   such cases, visually confusable characters from different scripts
   will not be allowed to coexist in a single set of permissible
   codepoints unless a corresponding policy and character table is
   clearly defined."

   Additionally:

   "Permissible code points will not include: (a) line symbol-drawing
   characters (as those in the Unicode Box Drawing block), (b) symbols
   and icons that are neither alphanumeric nor ideographic language
   characters, such as typographic and pictographic dingbats, (c)
   characters with well-established functions as protocol elements, (d)
   punctuation marks used solely to indicate the structure of
   sentences."

   Attention has been called to several points that are not adequately
   dealt with (if at all) in the version 2.0 guidelines but that ought
   to be included in the policy framework without waiting for the
   production and release of a document based on a "best practices"
   model.  The term "BCP" above does not necessarily refer to an IETF
   consensus document.

   The intention in November 2005 was for the recommended major revision
   to be put to the ICANN Board prior to its meeting in Morocco (in late
   June 2006), but for the changes to be collated incrementally and
   appear in interim version 2.n releases of the guidelines.  The IAB's
   understanding is that, while there has been some progress with this,


Klensin, et al.              Informational                     [Page 10]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   other issues relating to IDNs subsequently diverted much of the
   energy that was intended to be devoted to the more extensive
   treatment of the guidelines.

2.  General Problems and Issues

   This section interweaves problems and issues of several types.  Each
   subsection outlines something that is perceived to be a problem or
   issue "with IDNs", therefore needing correction.  Some of these
   issues can be at least partially resolved by making changes to
   elements of the IDNA protocol or tables.  Others will exist as long
   as people have expectations of IDNs that are inconsistent with the
   basic DNS architecture.  It is important to identify this entire
   range of problems because users, registrants, and policy makers often
   do not understand the protocol and other technical issues but only
   the difference between what they believe happens or should happen and
   what actually happens.  As long as those differences exist, there
   will be demands for functionality or policy changes for IDNs.  Of
   course, some of these demands will be less realistic than others, but
   even the realistic ones should be understood in the same context as
   the others.

   Most of the issues that have been raised, and that are discussed in
   this document, exist whether IDNA remains tied to Unicode 3.2 or
   whether migration to new Unicode versions is contemplated.  A
   migration path is necessary to accommodate newly-coded scripts and to
   permit the maximum number of languages and scripts to be represented
   in domain names.  However, the migration issues are largely separate
   from those involving a single Unicode version or Version 3.2 in
   particular, so they have been separated into this section and
   Section 3.

2.1.  User Conceptions, Local Character Sets, and Input issues

   The labels of the DNS are just strings of characters that are not
   inherently tied to a particular language.  As mentioned briefly in
   the Introduction, DNS labels that could not lexically be words in any
   language are possible and indeed common.  There appears to be no
   reason to impose protocol restrictions on IDNs that would restrict
   them more than all-ASCII hostname labels have been restricted.  For
   that reason, even describing DNS labels or strings of them as "names"
   is something of a misnomer, one that has probably added to user
   confusion about what to expect.

   Ordinarily, people use "words" when they think of things and wish
   others to think of them too, for example, "orange", "tree",
   "restaurant" or "Acme Inc".  Words are normally in a specific
   language, such as English or Swedish.  The character-string labels


Klensin, et al.              Informational                     [Page 11]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   supported by the DNS are, as suggested above, not inherently "words".
   While it is useful, especially for mnemonic value or to identify
   objects, for actual words to be used as DNS labels, other constraints
   on the DNS make it impossible to guarantee that it will be possible
   to represent every word in every language as a DNS label,
   internationalized or not.

   When writing or typing the label (or word), a script must be selected
   and a charset must be picked for use with that script.  The choice of
   charset is typically not under the control of the user on a per-word
   or per-document basis, but may depend on local input devices,
   keyboard or terminal drivers, or other decisions made by operating
   system or even hardware designers and implementers.

   If that charset, or the local charset being used by the relevant
   operating system or application software, is not Unicode, a further
   conversion must be performed to produce Unicode.  How often this is
   an issue depends on estimates of how widely Unicode is deployed as
   the native character set for hardware, operating systems, and
   applications.  Those estimates differ widely, but it should be noted
   that, among other difficulties:

   o  ISO 8859 versions [ISO.8859.2003] and even national variations of
      ISO 646 [ISO.646.1991], are still widely used in parts of Europe;

   o  code-table switching methods, typically based on the techniques of
      ISO 2022 [ISO.2022.1986] are still in general use in many parts of
      the world, especially in Japan with Shift-JIS and its variations;
      and

   o  computing, systems, and communications in China tend to use one or
      more of the national "GB" standards rather than native Unicode.

   Additionally, not all charsets define their characters in the same
   way and not all preexisting coding systems were incorporated into
   Unicode without changes.  Sometimes local distinctions were made that
   Unicode does not make or vice versa.  Consequently, conversion from
   other systems to Unicode may potentially lose information.

   The Unicode string that results from this processing -- processing
   that is trivial in a Unicode-native system but that may be
   significant in others -- is then used as input to IDNA.


Klensin, et al.              Informational                     [Page 12]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


2.2.  Examples of Issues

   While much of the discussion below is stated in terms of Unicode
   codings and associated rules, the IAB believes that some of the
   issues are actually not about the Unicode character set per se, but
   about how distributed matching systems operate in reality, and about
   what implications the distributed delayed search for stored data that
   characterizes the DNS has on the mapping algorithms.

2.2.1.  Language-Specific Character Matching

   There are similar words that can be expressed in multiple languages.
   Consider, for example, the name Torbjorn in Norwegian and Swedish.
   In Norwegian it is spelled with the character U+00F8 (LATIN SMALL
   LETTER O WITH STROKE) in the second syllable, while in Swedish it is
   spelled with U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS).  Those
   characters are not treated as equivalent according to the Unicode
   Standard and its Annexes while most people speaking Swedish, Danish,
   or Norwegian probably think they are equivalent.

   It is neither possible nor desirable to make these characters
   equivalent on a global basis.  To do so would, for this example,
   rationalize the situation in Sweden while causing considerable
   confusion in Germany because the U+00F8 character is never used in
   the German language.  But the "variant" model introduced in [RFC3743]
   and [RFC4290] can be used by a registry to prevent the worst
   consequence of the possible confusion, by ensuring either that both
   names are registered to the same party in a given domain or that one
   of them is completely prohibited.

2.2.2.  Multiple Scripts

   There are languages in the world that can be expressed using multiple
   scripts.  For example, some Eastern European and Central Asian
   languages can be expressed in either Cyrillic or Latin (see
   Section 1.5.2) characters, or some African and Southeast Asian
   languages can be expressed in either Arabic or Latin characters.  A
   few languages can even be written in three different scripts.  In
   other cases, the language is typically written in a combination of
   scripts (e.g., Kanji, Kana, and Romaji for Japanese; Hangul and Hanji
   for Korean).  Because of this, the same word, in the same language,
   can be expressed in different ways.  For some languages, only a
   single script is normally used to write a single word; for others,
   mixed scripts are required; and, for still others, special
   circumstances may dictate mixing scripts in labels although that is
   not normally done for "words".  For IDN purposes, these variations
   make the definition of "script" extremely sensitive, especially since
   ICANN is now recommending that it be used as the primary basis for


Klensin, et al.              Informational                     [Page 13]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   registry policies.  However essential it may be to prohibit mixed-
   script labels, additional policy nuance is required for "languages
   with established orthographies and conventions that require the
   commingled use of multiple scripts".

2.2.3.  Normalization and Character Mappings

   Unicode contains several different models for representing
   characters.  The Chinese (Han)-derived characters of the "CJK"
   (Chinese, Japanese, and Korean) languages are "unified", i.e.,
   characters with common derivation and similar appearances are
   assigned to the same code point.  European characters derived from a
   Greek-Latin base are separated into separate code blocks for Latin,
   Greek, and Cyrillic even when individual characters are identical in
   both form and semantics.  Separate code points based on font
   differences alone are generally prohibited, but a large number of
   characters for "mathematical" use have been assigned separate code
   points even though they differ from base ASCII characters only by
   font attributes such as "script", "bold", or "italic".  Some
   characters that often appear together are treated as typographical
   digraphs with specific code points assigned to the combination,
   others require that the two-character sequences be used, and still
   others are available in both forms.  Some Roman-derived letters that
   were developed as decorated variations on the basic Latin letter
   collection (e.g., by addition of diacritical marks) are assigned code
   points as individual characters, others must be built up as two (or
   more) character sequences using "combining characters".

   Many of these differences result from the desire to maintain backward
   compatibility while the standard evolved historically, and are hence
   understandable.  However, the DNS requires precise knowledge of which
   codes and code sequences represent the same character and which ones
   do not.  Limiting the potential difficulties with confusable
   characters (see Section 2.2.6) requires even more knowledge of which
   characters might look alike in some fonts but not in others.  These
   variations make it difficult or impossible to apply a single set of
   rules to all of Unicode and, in doing so, satisfy everyone and their
   perceived needs.  Instead, more or less complex mapping tables,
   defined on a character-by-character basis, are required to
   "normalize" different representations of the same character to a
   single form so that matching is possible.

   Unless normalization rules, such as those that underlie Nameprep, are
   applied, characters that are essentially identical will not match in
   the DNS, creating many opportunities for problems.  The most common
   of these problems is that, due to the processing applied (and
   discussed above) before a word is represented as a Unicode string, a
   single word can end up being expressed as several different Unicode


Klensin, et al.              Informational                     [Page 14]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   strings.  Even if normalization rules are applied, some strings that
   are considered identical by users will not compare equal.  That
   problem is discussed in more detail elsewhere in this document,
   particularly in Section 3.2.1.

   IDNA attempts to compensate for these problems by using a
   normalization algorithm defined by the Unicode Consortium.  This
   algorithm can change a sequence of one or more Unicode characters to
   another set of characters.  One example is that the base character
   U+0061 (LATIN SMALL LETTER A) followed by U+0308 (COMBINING
   DIAERESIS) is changed to the single Unicode character U+00E4 (LATIN
   SMALL LETTER A WITH DIAERESIS).

   This Unicode normalization process accounts only for simple character
   equivalences, not equivalences that are language or script dependent.
   For example, as mentioned above, the characters U+00F8 (LATIN SMALL
   LETTER O WITH STROKE) and U+00F6 (LATIN SMALL LETTER O WITH
   DIAERESIS) are considered to match in Swedish (and some other
   languages), but not for all languages that use either of the
   characters.  Having these characters be treated as equivalent in some
   contexts and not in others requires decisions and mechanisms that, in
   turn, depend much more on context than either IDNA or the Unicode
   character-based normalization tables can provide.

   Additional complications occur if the sequences are more complicated
   or if an attacker is making a deliberate effort to confuse the
   normalization process.  For example, if the sequence U+0069 U+0307
   (LATIN SMALL LETTER I followed by COMBINING DOT ABOVE) appears, the
   Unicode Normalization Method known as NFKC maps it into U+00EF (LATIN
   SMALL LETTER I WITH DIAERESIS), which is what one would predict.  But
   consider U+0131 U+0308 (LATIN SMALL LETTER DOTLESS I and COMBINING
   DIAERESIS):  is that the same character?  Is U+0131 U+0307 U+0307
   (dotless i and two combining dot-above characters) equivalent to
   U+00EF or U+0069, or neither?  NFKC does not appear to tell us, nor
   does the definition of U+0307 appear to tell us what happens when it
   is combined with other "symbol above" arrangements (unlike some of
   the "accent above" combining characters, which more or less specify
   kerning).  Similar issues arise when U+00EF is combined with various
   dot-above combining characters.  Each of these questions provides
   some opportunities for spoofing if different display implementations
   interpret the rules in different ways.

   If we leave Latin scripts and examine those based on Chinese
   characters, we see there is also an absence of specific, lexigraphic,
   rules for transformations between Traditional and Simplified Chinese.
   Even if there were such rules, unification of Japanese and Korean


Klensin, et al.              Informational                     [Page 15]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   characters with Chinese ones would make it impossible to normalize
   Traditional Chinese into Simplified Chinese ones without causing
   problems in Japanese and Korean use of the same characters.

   More generally, while some mappings, such as those between
   precomposed Latin script characters and the equivalent multiple code
   point composed character sequences, depend only on the characters
   themselves, in many or most cases, such as the case with Swedish
   above, the mapping is language or culturally dependent.  There have
   been discussions as to whether different canonicalization rules (in
   addition to or instead of Unicode normalization) should be, or could
   be, applied differently to different languages or scripts.  The fact
   that most scripts included in Unicode have been initially
   incorporated by copying an existing standard more or less intact has
   impact on the optimization of these algorithms and on forward
   compatibility.  Even if the language is known and language-specific
   rules can be defined, dependencies on the language do not disappear.
   Canonicalization operations are not possible unless they either
   depend only on short sequences of text or have significant context
   available that is not obvious from the text itself.  DNS lookups and
   many other operations do not have a way to capture and utilize the
   language or other information that would be needed to provide that
   context.

   These variations in languages and in user perceptions of characters
   make it difficult or impossible to provide uniform algorithms for
   matching Unicode strings in a way that no end users are ever
   surprised by the result.  For closely-related scripts or characters,
   surprises may even be frequent.  However, because uniform algorithms
   are required for mappings that are applied when names are looked up
   in the DNS, the rules that are chosen will always represent an
   approximation that will be more or less successful in minimizing
   those user surprises.  The current Nameprep and Stringprep algorithms
   use mapping tables to "normalize" different representations of the
   same text to a single form so that matching is possible.

   More details on the creation of the normalization algorithms can be
   found in the Unicode Specification and the associated Technical
   Reports [UTR] and Annexes.  Technical Report #36 [UTR36] and [UTR39]
   are specifically related to the IDN discussion.

2.2.4.  URLs in Printed Form

   URLs and other identifiers appear, not only in electronic forms from
   which they can (at least in principle) be accurately copied and
   "pasted" but in printed forms from which the user must transcribe
   them into the computer system.  This is often known as the "side-of-
   the-bus problem" because a particularly problematic version of it


Klensin, et al.              Informational                     [Page 16]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   requires that the user be able to observe and accurately remember a
   URL that is quickly glimpsed in a transient form -- a billboard seen
   while driving, a sign on the side of a passing vehicle, a television
   advertisement that is not frequently repeated or on-screen for a long
   time, and so on.

   The difficulty, in short, is that two Unicode strings that are
   actually different might look exactly the same, especially when there
   is no time to study them.  This is because, for example, some glyphs
   in Cyrillic, Greek, and Latin do look the same, but have been
   assigned different code points in Unicode.  Worse, one needs to be
   reasonably familiar with a script and how it is used to understand
   how much characters can reasonably vary as the result of artistic
   fonts and typography.  For example, there are a few fonts for Latin
   characters that are sufficiently highly ornamented that an observer
   might easily confuse some of the characters with characters in Thai
   script.  Uppercase ITC Blackadder (a registered trademark of
   International Typeface Corporation) and Curlz MT are two fairly
   obvious examples; these fonts use loops at the end of serifs,
   creating a resemblance to Thai (in some fonts) for some characters.

2.2.5.  Bidirectional Text

   Some scripts (and because of that some words in some languages) are
   written not left to right, but right to left.  And, to complicate
   things, one might have something written in Arabic script right to
   left that includes some characters that are read from left to right,
   such as European-style digits.  This implies that some texts might
   have a mixed left-to-right AND right-to-left order (even though in
   most implementations, and in IDNA, all texts have a major direction,
   with the other as an exception).

   IDNA permits the inclusion of European digits in a label that is
   otherwise a sequence of right-to-left characters, but prohibits most
   other mixed-directional (or bidirectional) strings.  This prohibition
   can cause other problems such as the rejection of some otherwise
   linguistically and culturally sensible strings.  As Unicode and
   conventions for handling so-called bidirectional ("BIDI") strings
   evolve, the prohibition in IDNA should be reviewed and reevaluated.

2.2.6.  Confusable Character Issues

   Similar-looking characters in identifiers can cause actual problems
   on the Internet since they can result, deliberately or accidentally,
   in people being directed to the wrong host or mailbox by believing
   that they are typing, or clicking on, intended characters that are
   different from those that actually appear in the domain name or
   reference.  See Section 4.1.3 for further discussion of this issue.


Klensin, et al.              Informational                     [Page 17]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   IDNs complicate these issues, not only by providing many additional
   characters that look sufficiently alike to be potentially confused,
   but also by raising new policy questions.  For example, if a language
   can be written in two different scripts, is a label constructed from
   a word written in one script equivalent to a label constructed from
   the same word written in the other script?  Is the answer the same
   for words in two different languages that translate into each other?

   It is now generally understood that, in addition to the collision
   problems of possibly equivalent words and hence labels, it is
   possible to utilize characters that look alike -- "confusable"
   characters -- to spoof names in order to mislead or defraud users.
   That issue, driven by particular attacks such as those known as
   "phishing", has introduced stronger requirements for registry efforts
   to prevent problems than were previously generally recognized as
   important.

   One commonly-proposed approach is to have a registry establish
   restrictions on the characters, and combinations of characters, it
   will permit to be included in a string to be registered as a label.
   Taking the Swedish top-level domain, .SE, as an example, a rule might
   be adopted that the registry "only accepts registrations in Swedish,
   using Latin script, and because of this, Unicode characters Latin-a,
   -b, -c,...".  But, because there is not a 1:1 mapping between country
   and language, even a Country Code Top Level Domain (ccTLD) like .SE
   might have to accept registrations in other languages.  For example,
   there may be a requirement for Finnish (the second most-used language
   in Sweden).  What rules and code points are then defined for Finnish?
   Does it have special mappings that collide with those that are
   defined for Swedish?  And what does one do in countries that use more
   than one script?  (Finnish and Swedish use the same script.)  In all
   cases, the dispute will ultimately be about whether two strings are
   the same (or confusingly similar) or not.  That, in turn, will
   generate a discussion of how one defines "what is the same" and "what
   is similar enough to be a problem".

   Another example arose recently that further illustrates the problem.
   If one were to use Cyrillic characters to represent the country code
   for Russia in a localized equivalent to the ccTLD label, the
   characters themselves would be indistinguishable from the Latin
   characters "P" and "Y" (in either lower- or uppercase) in most fonts.
   We presume this might cause some consternation in Paraguay.

   These difficulties can never be completely eliminated by algorithmic
   means.  Some of the problem can be addressed by appropriate tuning of
   the protocols and their tables, other parts by registry actions to
   reduce confusion and conflicts, and still other parts can be


Klensin, et al.              Informational                     [Page 18]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   addressed by careful design of user interfaces in application
   programs.  But, ultimately, some responsibility to avoid being
   tricked or harmfully confused will rest with the user.

   Another registry technique that has been extensively explored
   involves looking at confusable characters and confusion between
   complete labels, restricting the labels that can be registered based
   on relationships to what is registered already.  Registries that
   adopt this approach might establish special mapping rules such as:

   1.  If you register something with code point A, domain names with B
       instead of A will be blocked from registration by others (where B
       is a character at a separate code point that has a confusingly
       similar appearance to A).

   2.  If you register something with code point A, you also get domain
       name with B instead of A.

   These approaches are discussed in more detail for "CJK" characters in
   RFC 3743 [RFC3743] and more generally in RFC 4290 [RFC4290].

2.2.7.  The IESG Statement and IDNA issues

   The issues above, at least as they were understood at the time,
   provided the background for the IESG statement included in
   Section 1.6.1 (which, in turn, was part of the basis for the initial
   ICANN Guidelines) that a registry should have a policy about the
   scripts, languages, code points and text directions for which
   registrations will be accepted.  While "accept all" might be an
   acceptable policy, it implies there is also a dispute resolution
   process that takes the problems listed above into account.  This
   process must be designed for dealing with all types of potential
   disputes.  For example, issues might arise between registrant and
   registry over a decision by the registry on collisions with already
   registered domain names and between registrant and trademark holder
   (that a domain name infringes on a trademark).  In both cases, the
   parties disagreeing have different views on whether two strings are
   "equivalent" or not.  They may believe that a string that is not
   allowed to be registered is actually different from one that is
   already registered.  Or they might believe that two strings are the
   same, even though the rules adopted by the registry to prevent
   confusion define them as two different domain names.


Klensin, et al.              Informational                     [Page 19]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


3.  Migrating to New Versions of Unicode

3.1.  Versions of Unicode

   While opinions differ about how important the issues are in practice,
   the use of Unicode and its supporting tables for IDNA appears to be
   far more sensitive to subtle changes than it is in typical Unicode
   applications.  This may be, at least in part, because many other
   applications are internally sensitive only to the appearance of
   characters and not to their representation.  Or those applications
   may be able to take effective advantage of script, language, or
   character class identification.  The working group that developed
   IDNA concluded that attempting to encode any ancillary character
   information into the DNS label would be impractical and unwise, and
   the IAB, based in part on the comments in the ad hoc committee, saw
   no reason to review that decision.

   The Unicode Consortium has sometimes used the likelihood of a
   combination of characters actually appearing in a natural language as
   a criterion for the safety of a possible change.  However, as
   discussed above, DNS names are often fabrications -- abbreviations,
   strings deliberately formed to be unusual, members of a series
   sequenced by numbers or other characters, and so on.  Consequently, a
   criterion that considers a change to be safe if it would not be
   visible in properly-constructed running text is not helpful for DNS
   purposes: a change that would be safe under that criterion could
   still be quite problematic for the DNS.

   This sensitivity to changes has made it quite difficult to migrate
   IDNA from one version of Unicode to the next if any changes are made
   that are not strictly additive.  A change in a code point assignment
   or definition may be extremely disruptive if a DNS label has been
   defined using the earlier form and any of its previous components has
   been moved from one table position or normalization rule to another.
   Unicode normalization tables, tables of scripts or languages and
   characters that belong to them, and even tables of confusable
   characters as an adjunct to security recommendations may be very
   helpful in designing registry restrictions on registrations and
   applications provisions for avoiding or identifying suspicious names.
   Ironically, they also extend the sensitivity of IDNA and its
   implementations to all forms of change between one version of Unicode
   and the next.  Consequently, they make Unicode version migration more
   difficult.

   An example of the type of change that appears to be just a small
   correction from one perspective but may be problematic from another
   was the correction to the normalization definition in 2004
   [Unicode-PR29].  Community input suggested that the change would


Klensin, et al.              Informational                     [Page 20]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   cause problems for Stringprep, but the Unicode Technical Committee
   decided, on balance, that the change was worthwhile.  Because of
   difficulties with consistency, some deployed implementations have
   decided to adopt the change and others have not, leading to subtle
   incompatibilities.

   This situation leads to a dilemma.  On the one hand, it is completely
   unacceptable to freeze IDNA at a Unicode version level that excludes
   more recently-defined characters and scripts that are important to
   those who use them.  On the other hand, it is equally unacceptable to
   migrate from one version of Unicode to the next if such migration
   might invalidate an existing registered DNS name or some of its
   registered properties or might make the string or representation of
   that name ambiguous.  If IDNA is to be modified to accommodate new
   versions of Unicode, the IETF will need to work with the Unicode
   Consortium and other bodies to find an appropriate balance in this
   area, but progress will be possible only if all relevant parties are
   able to fairly consider and discuss possible decisions that may be
   very difficult and unpalatable.

   It would also prove useful if, during the course of that dialog, the
   need for Unicode Consortium concern with security issues in
   applications of the Unicode character set could be clarified.  It
   would be unfortunate from almost every perspective considered here,
   if such matters slowed the inclusion of as yet unencoded scripts.

3.2.  Version Changes and Normalization Issues

3.2.1.  Unnormalized Combining Sequences

   One of the advantages of the Unicode model of combining characters,
   as with previous systems that use character overstriking to
   accomplish similar purposes, is that it is possible to use sequences
   of code points to generate characters that are not explicitly
   provided for in the character set.  However, unless sequences that
   are not explicitly provided for are prohibited by some mechanism
   (such as the normalization tables), such combining sequences can
   permit two related dangers.

   o  The first is another risk of character confusion, especially if
      the relationship of the combining character with characters it
      combines with are not precisely defined or unexpected combinations
      of combining characters are used.  That issue is discussed in more
      detail, with an example, in Section 2.2.3.

   o  These same issues also inherently impact the stability of the
      normalization tables.  Suppose that, somewhere in the world, there
      is a character that looks like a Roman-derived lowercase "i", but


Klensin, et al.              Informational                     [Page 21]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


      with three (not one or two) dots above it.  And suppose that the
      users of that character agree to represent it by combining a
      traditional "i" (U+0069) with a combining diaeresis (U+0308).  So
      far, no problem.  But, later, a broader need for this character is
      discovered and it is coded into Unicode either as a single
      precomposed character or, more likely under existing rules, by
      introducing a three-dot-above combining character.  In either
      case, that version of Unicode should include a rule in NFKC that
      maps the "i"-plus-diaeresis sequence into the new, approved, one.
      If one does not do so, then there is arguably a normalization that
      should occur that does not.  If one does so, then strings that
      were valid and normalized (although unanticipated) under the
      previous versions of Unicode become unnormalized under the new
      version.  That, in turn, would impact IDNA comparisons because,
      effectively, it would introduce a change in the matching rules.

   It would be useful to consider rules that would avoid or minimize
   these problems with the understanding that, for reasons given
   elsewhere, simply minimizing it may not be good enough for IDNA.  One
   partial solution might be to ban any combination of a base character
   and a combining character that does not appear in a hypothetical
   "anticipated combinations" table from being used in a domain name
   label.  The next subsection discusses a more radical, if impractical,
   view of the problem and its solutions.

3.2.2.  Combining Characters and Character Components

   For several reasons, including those discussed above, one thing that
   increases IDNA complexity and the need for normalization is that
   combining characters are permitted.  Without them, complexity might
   be reduced enough to permit easier transitions to new versions.  The
   community should consider the impact of entirely prohibiting
   combining characters from IDNs.  While it is almost certainly
   unfeasible to introduce this change into Unicode as it is now defined
   and doing so would be extremely disruptive even if it were feasible,
   the thought experiment can be helpful in understanding both the
   issues and the implications of the paths not taken.  For example, one
   consequence of this, of course, is that each new language or script,
   and several existing ones, would require that all of its characters
   have Unicode assignments to specific, precomposed, code points.

   Note that this is not currently permitted within Unicode for Latin
   scripts.  For non-Latin scripts, some such code points have been
   defined.  The decisions that govern the assignment of such code
   points are managed entirely within the Unicode Consortium.  Were the
   IETF to choose to reduce IDNA complexity by excluding combining
   characters, no doubt there would be additional input to the Unicode
   Consortium from users and proponents of scripts that precomposed


Klensin, et al.              Informational                     [Page 22]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   characters be required.  The IAB and the IETF should examine whether
   it is appropriate to press the Unicode Consortium to revise these
   policies or otherwise to recommend actions that would reduce the need
   for normalization and the related complexities.  However, we have
   been told that the Technical Committee does not believe it is
   reasonable or feasible to add all possible precomposed characters to
   Unicode.  If Unicode cannot be modified to contain the precomposed
   characters necessary to support existing languages and scripts, much
   less new ones, this option for IDN restrictions will not be feasible.

3.2.3.  When does normalization occur?

   In many Unicode applications, the preferred solution is to pick a
   style of normalization and require that all text that is stored or
   transmitted be normalized to that form.  (This is the approach taken
   in ongoing work in the IETF on a standard Unicode text form
   [net-utf8]).  IDNA does not impose this requirement.  Text is
   normalized and case-reduced at registration time, and only the
   normalized version is placed in the DNS.  However, there is no
   requirement that applications show only the native (and lower-case
   where appropriate) characters associated with the normalized form in
   discussions or references such as URLs.  If conventions used for
   all-ASCII DNS labels are to be extended to internationalized forms,
   such a requirement would be unreasonable, since it would prohibit the
   use of mixed-case references for clarity or market identification.
   It might even be culturally inappropriate.  However, without that
   restriction, the comparison that will ultimately be made in the DNS
   will be between strings normalized at different times and under
   different versions of Unicode.  The assertion that a string in
   normalized form under one version of Unicode will still be in
   normalized form under all future versions is not sufficient.
   Normalization at different times also requires that a given source
   string always normalizes to the same target string, regardless of the
   version under which it is normalized.  That criterion is much more
   difficult to fulfill.  The discussion above suggests that it may even
   be impossible.

   Ignoring these issues with combining characters entirely, as IDNA
   effectively does today, may leave us "stuck" at Unicode 3.2, leading
   either to incompatibility differences in applications that otherwise
   use a modern version of Unicode (while IDN remains at Unicode 3.2) or
   to painful transitions to new versions.  If decisions are made
   quickly, it may still be possible to make a one-time version upgrade
   to Version 4.1 or Version 5 of Unicode.  However, unless we can
   impose sufficient global restrictions to permit smooth transitions,
   upgrading to versions beyond that one are likely to be painful (e.g.,
   potentially requiring changing strings already in the DNS or even a
   new Punycode prefix) or impossible.


Klensin, et al.              Informational                     [Page 23]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


4.  Framework for Next Steps in IDN Development

4.1.  Issues within the Scope of the IETF

4.1.1.  Review of IDNA

   The IETF should consider reviewing RFCs 3454, 3490, 3491, and/or
   3492, and update, replace, or supplement them to meet the criteria of
   this paragraph (one or more of them may prove impractical after
   further study).  Any new versions or additional specifications should
   be adapted to the version of Unicode that is current when they are
   created.  Ideally, they should specify a path for adapting to future
   versions of Unicode (some suggestions below may facilitate this).
   The IETF should also consider whether there are significant
   advantages to mapping some groups of characters, such as code points
   assigned to font variations, into others or whether clarity and
   comprehensibility for the user would be better served by simply
   prohibiting those characters.  More generally, it appears that it
   would be worthwhile for the IETF to review whether the Unicode
   normalization rules now invoked by the Stringprep profile in Nameprep
   are optimal for the DNS or whether more restrictive rules, or an even
   more restrictive set of permitted character combinations, would
   provide better support for DNS internationalization.

   The IAB has concluded that there is a consensus within the broader
   community that lists of code points should be specified by the use of
   an inclusion-based mechanism (i.e., identifying the characters that
   are permitted), rather than by excluding a small number of characters
   from the total Unicode set as Stringprep and Nameprep do today.  That
   conclusion should be reviewed by the IETF community and action taken
   as appropriate.

   We suggest that the individuals doing the review of the code points
   should work as a specialized design team.  To the extent possible,
   that work should be done jointly by people with experience from the
   IETF and deep knowledge of the constraints of the DNS and application
   design, participants from the Unicode Consortium, and other people
   necessary to be able to reach a generally-accepted result.  Because
   any work along these lines would be modifications and updates to
   standards-track documents, final review and approval of any proposals
   would necessarily follow normal IETF processes.

   It is worth noting that sufficiently extreme changes to IDNA would
   require a new Punycode prefix, probably with long-term support for
   both the old prefix and the new one in both registration arrangements
   and applications.  An alternative, which is almost certainly
   impractical, would be some sort of "flag day", i.e., a date on which
   the old rules are simultaneously abandoned by everyone and the new


Klensin, et al.              Informational                     [Page 24]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   ones adopted.  However, preliminary analysis indicates that few, if
   any, of the changes recommended for consideration elsewhere in this
   document would require this type of version change.  For example,
   suppose additional restrictions, such as those implied above, are
   imposed on what can be registered.  Those restrictions might require
   policy decisions about how labels are to be disposed of if they
   conformed to the earlier rules but not to the new ones.  But they
   would not inherently require changes in the protocol or prefix.

4.1.2.  Non-DNS and Above-DNS Internationalization Approaches

   The IETF should once again examine the extent to which it is
   appropriate to try to solve internationalization problems via the DNS
   and what place the many varieties of so-called "keyword systems" or
   other Internet navigational techniques might have.  Those techniques
   can be designed to impose fewer constraints, or at least different
   constraints, than IDNA and the DNS.  As discussed elsewhere in this
   document, IDNA cannot support information about scripts, languages,
   or Unicode versions on lookup.  As a consequence of the nature of DNS
   lookups, characters and labels either match or do not match; a near-
   match is simply not a possible concept in the DNS.  By contrast,
   observation of near-matching is common in human communication and in
   matching operations performed by people, especially when they have a
   particular script or language context in mind.  The DNS is further
   constrained by a fairly rigid internal aliasing system (via CNAME and
   DNAME resource records), while some applications of international
   naming may require more flexibility.  Finally, the rigid hierarchy of
   the DNS --and the tendency in practice for it to become flat at
   levels nearest the root-- and the need for names to be unique are
   more suitable for some purposes than others and may not be a good
   match for some purposes for which people wish to use IDNs.  Each of
   these constraints can be relaxed or changed by one or more systems
   that would provide alternatives to direct use of the DNS by users.
   Some of the issues involved are discussed further in Section 5.3 and
   various ideas have been discussed in detail in the IETF or IRTF.
   Many of those ideas have even been described in Internet Drafts or
   other documents.  As experience with IDNs and with expectations for
   them accumulates, it will probably become appropriate for the IETF or
   IRTF to revisit the underlying questions and possibilities.

4.1.3.  Security Issues, Certificates, etc.

   Some characters look like others, often as the result of common
   origins.  The problem with these "confusable" characters, often
   incorrectly called homographs, has always existed when characters are
   presented to humans who interpret what is displayed and then make
   decisions based on what is seen.  This is not a problem that exists
   only when working with internationalized domain names, but they make


Klensin, et al.              Informational                     [Page 25]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   the problem worse.  The result of a survey that would explain what
   the problems are might be interesting.  Many of these issues are
   mentioned in Unicode Technical Report #36 [UTR36].

   In this and other issues associated with IDNs, precise use of
   terminology is important lest even more confusion result.  The
   definition of the term 'homograph' that normally appears in
   dictionaries and linguistic texts states that homographs are
   different words that are spelled identically (for example, the
   adjective 'brief' meaning short, the noun 'brief' meaning a document,
   and the verb 'brief' meaning to inform).  By definition, letters in
   two different alphabets are not the same, regardless of similarities
   in appearance.  This means that sequences of letters from two
   different scripts that appear to be identical on a computer display
   cannot be homographs in the accepted sense, even if they are both
   words in the dictionary of some language.  Assuming that there is a
   language written with Cyrillic script in which "cap" is a word,
   regardless of what it might mean, it is not a homograph of the
   Latin-script English word "cap".

   When the security implications of visually confusable characters were
   brought to the forefront in 2005, the term homograph was used to
   designate any instance of graphic similarity, even when comparing
   individual characters.  This usage is not only incorrect, but risks
   introducing even more confusion and hence should be avoided.  The
   current preferred terminology is to describe these similar-looking
   characters as "confusable characters" or even "confusables".

   Many people have suggested that confusable characters are a problem
   that must be addressed, at least in part, directly in the user
   interfaces of application software.  While it should almost certainly
   be part of a complete solution, that approach creates it own set of
   difficulties.  For example, a user switching between systems, or even
   between applications on the same system, may be surprised by
   different types of behavior and different levels of protection.  In
   addition, it is unclear how a secure setup for the end user should be
   designed.  Today, in the web browser, a padlock is a traditional way
   of describing some level of security for the end user.  Is this
   binary signaling enough?  Should there be any connection between a
   risk for a displayed string including confusable characters and the
   padlock or similar signaling to the user?

   Many web browsers have adopted a convention, based on a "whitelist"
   or similar technique, of restricting the display of native characters
   to subdomains of top-level domains that are deemed to have safe
   practices for the registration of potentially confusable labels.
   IDNs in other domains are displayed as Punycode.  These techniques
   may not be sufficiently sensitive to differences in policies among


Klensin, et al.              Informational                     [Page 26]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   top-level domains and their subdomains and so, while they are clearly
   helpful, they may not be adequate.  Are other methods of dealing with
   confusable characters possible?  Would other methods of identifying
   and listing policies about avoiding confusing registrations be
   feasible and helpful?

   It would be interesting to see a more coordinated effort in
   establishing guidelines for user interfaces.  If nothing else, the
   current whitelists are browser specific and both can, and do, differ
   between implementations.

4.1.4.  Protocol Changes and Policy Implications

   Some potential protocol or table changes raise important policy
   issues about what to do with existing, registered, names.  Should
   such changes be needed, their impact must be carefully evaluated in
   the IETF, ICANN, and possibly other forums.  In particular, protocol
   or policy changes that would not permit existing names to be
   registered under the newer rules should be considered carefully,
   balancing their importance against possible disruption and the issues
   of invalidating older names against the importance of consistency as
   seen by the user.

4.1.5.  Non-US-ASCII in Local Part of Email Addresses

   Work is going on in the IETF related to the local part of email
   addresses.  It should be noted that the local part of email addresses
   has much different syntax and constraints than a domain name label,
   so to directly apply IDNA on the local part is not possible.

4.1.6.  Use of the Unicode Character Set in the IETF

   Unicode and the closely-related ISO 10646 are the only coded
   character sets that aspire to include all of the world's characters.
   As such, they permit use of international characters without having
   to identify particular character coding standards or tables.  The
   requirement for a single character set is particularly important for
   use with the DNS since there is no place to put character set
   identification.  The decision to use Unicode as the base for IETF
   protocols going forward is discussed in [RFC2277].  The IAB does not
   see any reason to revisit the decision to use Unicode in IETF
   protocols.


Klensin, et al.              Informational                     [Page 27]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


4.2.  Issues That Fall within the Purview of ICANN

4.2.1.  Dispute Resolution

   IDNs create new types of collisions between trademarks and domain
   names as well as collisions between domain names.  These have impact
   on dispute resolution processes used by registries and otherwise.  It
   is important that deployment of IDNs evolve in parallel with review
   and updating of ICANN or registry-specific dispute resolution
   processes.

4.2.2.  Policy at Registries

   The IAB recommends that registries use an inclusion-based model when
   choosing what characters to allow at the time of registration.  This
   list of characters is in turn to be a subset of what is allowed
   according to the updated IDNA standard.  The IAB further recommends
   that registries develop their inclusion-based models in parallel with
   dispute resolution process at the registry itself.

   Most established policies for dealing with claimed or apparent
   confusion or conflicts of names are based on dispute resolution.
   Decisions about legitimate use or registration of one or more names
   are resolved at or after the time of registration on a case-by-case
   basis and using policies that are specific to the particular DNS zone
   or jurisdiction involved.  These policies have generally not been
   extended below the level of the DNS that is directly controlled by
   the top-level registry.

   Because of the number of conflicts that can be generated by the
   larger number of available and confusable characters in Unicode, we
   recommend that registration-restriction and dispute resolution
   policies be developed to constrain registration of IDNs and zone
   administrators at all levels of the DNS tree.  Of course, many of
   these policies will be less formal than others and there is no
   requirement for complete global consistency, but the arguments for
   reduction of confusable characters and other issues in TLDs should
   apply to all zones below that specific TLD.

   Consistency across all zones can obviously only be accomplished by
   changes to the protocols.  Such changes should be considered by the
   IETF if particular restrictions are identified that are important and
   consistent enough to be applied globally.

   Some potential protocol changes or changes to character-mapping
   tables might, if adopted, have profound registry policy implications.
   See Section 4.1.4.


Klensin, et al.              Informational                     [Page 28]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


4.2.3.  IDNs at the Top Level of the DNS

   The IAB has concluded that there is not one issue with IDNs at the
   top level of the DNS (IDN TLDs) but at least three very separate
   ones:

   o  If IDNs are to be entered in the root zone, decisions must first
      be made about how these TLDs are to be named and delegated.  These
      decisions fall within the traditional IANA scope and are ICANN
      issues today.

   o  There has been discussion of permitting some or all existing TLDs
      to be referenced by multiple labels, with those labels presumably
      representing some understanding of the "name" of the TLD in
      different languages.  If actual aliases of this type are desired
      for existing domains, the IETF may need to consider whether the
      use of DNAME records in the root is appropriate to meet that need,
      what constraints, if any, are needed, whether alternate
      approaches, such as those of [RFC4185], are appropriate or whether
      further alternatives should be investigated.  But, to the extent
      to which aliases are considered desirable and feasible, decisions
      presumably must be made as to which, if any, root IDN labels
      should be associated with DNAME records and which ones should be
      handled by normal delegation records or other mechanisms.  That
      decision is one of DNS root-level namespace policy and hence falls
      to ICANN although we would expect ICANN to pay careful attention
      to any technical, operational, or security recommendations that
      may be produced by other bodies.

   o  Finally, if IDN labels are to be placed in the root zone, there
      are issues associated with how they are to be encoded and
      deployed.  This area may have implications for work that has been
      done, or should be done, in the IETF.

5.  Specific Recommendations for Next Steps

   Consistent with the framework described above, the IAB offers these
   recommendations as steps for further consideration in the identified
   groups.

5.1.  Reduction of Permitted Character List

   Generalize from the original "hostname" rules to non-ASCII
   characters, permitting as few characters as possible to do that job.
   This would involve a restrictive model for characters permitted in
   IDN labels, thus contrasting with the approach used to develop the
   original IDNA/Nameprep tables.  That approach was to include all
   Unicode characters that there was not a clear reason to exclude.


Klensin, et al.              Informational                     [Page 29]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   The specific recommendation here is to specify such internationalized
   hostnames.  Such an activity would fall to the IETF, although the
   task of developing the appropriate list of permitted characters will
   require effort both in the IETF and elsewhere.  The effort should be
   as linguistically and culturally sensitive as possible, but smooth
   and effective operation of the DNS, including minimizing of
   complexity, should be primary goals.  The following should be
   considered as possible mechanisms for achieving an appropriate
   minimum number of characters.

5.1.1.  Elimination of All Non-Language Characters

   Unicode characters that are not needed to write words or numbers in
   any of the world's languages should be eliminated from the list of
   characters that are appropriate in DNS labels.  In addition to such
   characters as those used for box-drawing and sentence punctuation,
   this should exclude punctuation for word structure and other
   delimiters.  While DNS labels may conveniently be used to express
   words in many circumstances, the goal is not to express words (or
   sentences or phrases), but to permit the creation of unambiguous
   labels with good mnemonic value.

5.1.2.  Elimination of Word-Separation Punctuation

   The inclusion of the hyphen in the original hostname rules is a
   historical artifact from an older, flat, namespace.  The community
   should consider whether it is appropriate to treat it as a simple
   legacy property of ASCII names and not attempt to generalize it to
   other scripts.  We might, for example, not permit claimed equivalents
   to the hyphen from other scripts to be used in IDNs.  We might even
   consider banning use of the hyphen itself in non-ASCII strings or,
   less restrictively, strings that contained non-Latin characters.

5.2.  Updating to New Versions of Unicode

   As new scripts, to support new languages, continue to be added to
   Unicode, it is important that IDNA track updates.  If it does not do
   so, but remains "stuck" at 3.2 or some single later version, it will
   not be possible to include labels in the DNS that are derived from
   words in languages that require characters that are available only in
   later versions.  Making those upgrades is difficult, and will
   continue to be difficult, as long as new versions require, not just
   addition of characters, but changes to canonicalization conventions,
   normalization tables, or matching procedures (see Section 3.1).
   Anything that can be done to lower complexity and simplify forward
   transitions should be seriously considered.


Klensin, et al.              Informational                     [Page 30]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


5.3.  Role and Uses of the DNS

   We wish to remind the community that there are boundaries to the
   appropriate uses of the DNS.  It was designed and implemented to
   serve some specific purposes.  There are additional things that it
   does well, other things that it does badly, and still other things it
   cannot do at all.  No amount of protocol work on IDNs will solve
   problems with alternate spellings, near-matches, searching for
   appropriate names, and so on.  Registration restrictions and
   carefully-designed user interfaces can be used to reduce the risk and
   pain of attempts to do some of these things gone wrong, as well as
   reducing the risks of various sort of deliberate bad behavior, but,
   beyond a certain point, use of the DNS simply because it is available
   becomes a bad tradeoff.  The tradeoff may be particularly unfortunate
   when the use of IDNs does not actually solve the proposed problem.
   For example, internationalization of DNS names does not eliminate the
   ASCII protocol identifiers and structure of URIs [RFC3986] and even
   IRIs [RFC3987].  Hence, DNS internationalization itself, at any or
   all levels of the DNS tree, is not a sufficient response to the
   desire of populations to use the Internet entirely in their own
   languages and the characters associated with those languages.

   These issues are discussed at more length, and alternatives
   presented, in [RFC2825], [RFC3467], [INDNS], and [DNS-Choices].

5.4.  Databases of Registered Names

   In addition to their presence in the DNS, IDNs introduce issues in
   other contexts in which domain names are used.  In particular, the
   design and content of databases that bind registered names to
   information about the registrant (commonly described as "whois"
   databases) will require review and updating.  For example, the whois
   protocol itself [RFC3912] has no standard capability for handling
   non-ASCII text: one cannot search consistently for, or report, either
   a DNS name or contact information that is not in ASCII characters.
   This may provide some additional impetus for a switch to IRIS
   [RFC3981] [RFC3982] but also raises a number of other questions about
   what information, and in what languages and scripts, should be
   included or permitted in such databases.

6.  Security Considerations

   This document is simply a discussion of IDNs and IDNA issues; it
   raises no new security concerns.  However, if some of its
   recommendations to reduce IDNA complexity, the number of available
   characters, and various approaches to constraining the use of
   confusable characters, are followed and prove successful, the risks
   of name spoofing and other problems may be reduced.


Klensin, et al.              Informational                     [Page 31]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


7.  Acknowledgements

   The contributions to this report from members of the IAB-IDN ad hoc
   committee are gratefully acknowledged.  Of course, not all of the
   members of that group endorse every comment and suggestion of this
   report.  In particular, this report does not claim to reflect the
   views of the Unicode Consortium as a whole or those of particular
   participants in the work of that Consortium.

   The members of the ad hoc committee were: Rob Austein, Leslie Daigle,
   Tina Dam, Mark Davis, Patrik Faltstrom, Scott Hollenbeck, Cary Karp,
   John Klensin, Gervase Markham, David Meyer, Thomas Narten, Michael
   Suignard, Sam Weiler, Bert Wijnen, Kurt Zeilenga, and Lixia Zhang.

   Thanks are due to Tina Dam and others associated with the ICANN IDN
   Working Group for contributions of considerable specific text, to
   Marcos Sanz and Paul Hoffman for careful late-stage reading and
   extensive comments, and to Pete Resnick for many contributions and
   comments, both in conjunction with his former IAB service and
   subsequently.  Olaf M. Kolkman took over IAB leadership for this
   document after Patrik Faltstrom and Pete Resnick stepped down in
   March 2006.

   Members of the IAB at the time of approval of this document were:
   Bernard Aboba, Loa Andersson, Brian Carpenter, Leslie Daigle, Patrik
   Faltstrom, Bob Hinden, Kurtis Lindqvist, David Meyer, Pekka Nikander,
   Eric Rescorla, Pete Resnick, Jonathan Rosenberg and Lixia Zhang.

8.  References

8.1.  Normative References

   [ISO10646]          International Organization for Standardization,
                       "Information Technology - Universal Multiple-
                       Octet Coded Character Set (UCS) - Part 1:
                       Architecture and Basic Multilingual Plane"",
                       ISO/IEC 10646-1:2000, October 2000.

   [RFC3454]           Hoffman, P. and M. Blanchet, "Preparation of
                       Internationalized Strings ("stringprep")",
                       RFC 3454, December 2002.

   [RFC3490]           Faltstrom, P., Hoffman, P., and A. Costello,
                       "Internationalizing Domain Names in Applications
                       (IDNA)", RFC 3490, March 2003.


Klensin, et al.              Informational                     [Page 32]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   [RFC3491]           Hoffman, P. and M. Blanchet, "Nameprep: A
                       Stringprep Profile for Internationalized Domain
                       Names (IDN)", RFC 3491, March 2003.

   [RFC3492]           Costello, A., "Punycode: A Bootstring encoding of
                       Unicode for Internationalized Domain Names in
                       Applications (IDNA)", RFC 3492, March 2003.

   [Unicode32]         The Unicode Consortium, "The Unicode Standard,
                       Version 3.0", 2000.
                       (Reading, MA, Addison-Wesley, 2000.  ISBN
                       0-201-61633-5).  Version 3.2 consists of the
                       definition in that book as amended by the Unicode
                       Standard Annex #27: Unicode 3.1
                       (http://www.unicode.org/reports/tr27/) and by the
                       Unicode Standard Annex #28: Unicode 3.2
                       (http://www.unicode.org/reports/tr28/).

8.2.  Informative References

   [DNS-Choices]       Faltstrom, P., "Design Choices When Expanding
                       DNS", Work in Progress, June 2005.

   [ICANNv1]           ICANN, "Guidelines for the Implementation of
                       Internationalized Domain Names, Version 1.0",
                       March 2003, <http://www.icann.org/general/
                       idn-guidelines-20jun03.htm>.

   [ICANNv2]           ICANN, "Guidelines for the Implementation of
                       Internationalized Domain Names, Version 2.0",
                       November 2005, <http://www.icann.org/general/
                       idn-guidelines-20sep05.htm>.

   [IESG-IDN]          Internet Engineering Steering Group (IESG), "IESG
                       Statement on IDN", IESG Statements IDN Statement,
                       February 2003, <http://www.ietf.org/IESG/
                       STATEMENTS/IDNstatement.txt>.

   [INDNS]             National Research Council, "Signposts in
                       Cyberspace: The Domain Name System and Internet
                       Navigation", National Academy Press ISBN 0309-
                       09640-5 (Book) 0309-54979-5 (PDF), 2005, <http://
                       www7.nationalacademies.org/cstb/pub_dns.html>.

   [ISO.2022.1986]     International Organization for Standardization,
                       "Information Processing: ISO 7-bit and 8-bit
                       coded character sets: Code extension techniques",
                       ISO Standard 2022, 1986.


Klensin, et al.              Informational                     [Page 33]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   [ISO.646.1991]      International Organization for Standardization,
                       "Information technology - ISO 7-bit coded
                       character set for information interchange",
                       ISO Standard 646, 1991.

   [ISO.8859.2003]     International Organization for Standardization,
                       "Information processing - 8-bit single-byte coded
                       graphic character sets - Part 1: Latin alphabet
                       No. 1 (1998) - Part 2: Latin alphabet No. 2
                       (1999) - Part 3: Latin alphabet No. 3 (1999) -
                       Part 4: Latin alphabet No. 4 (1998) - Part 5:
                       Latin/Cyrillic alphabet (1999) - Part 6: Latin/
                       Arabic alphabet (1999) - Part 7: Latin/Greek
                       alphabet (2003) - Part 8: Latin/Hebrew alphabet
                       (1999) - Part 9: Latin alphabet No. 5 (1999) -
                       Part 10: Latin alphabet No. 6 (1998) - Part 11:
                       Latin/Thai alphabet (2001) - Part 13: Latin
                       alphabet No. 7 (1998) - Part 14: Latin alphabet
                       No. 8 (Celtic) (1998) - Part 15: Latin alphabet
                       No. 9 (1999) - Part 16: Part 16: Latin alphabet
                       No. 10 (2001)", ISO Standard 8859, 2003.

   [RFC2277]           Alvestrand, H., "IETF Policy on Character Sets
                       and Languages", BCP 18, RFC 2277, January 1998.

   [RFC2825]           IAB and L. Daigle, "A Tangled Web: Issues of
                       I18N, Domain Names, and the Other Internet
                       protocols", RFC 2825, May 2000.

   [RFC3066]           Alvestrand, H., "Tags for the Identification of
                       Languages", BCP 47, RFC 3066, January 2001.

   [RFC3467]           Klensin, J., "Role of the Domain Name System
                       (DNS)", RFC 3467, February 2003.

   [RFC3536]           Hoffman, P., "Terminology Used in
                       Internationalization in the IETF", RFC 3536,
                       May 2003.

   [RFC3743]           Konishi, K., Huang, K., Qian, H., and Y. Ko,
                       "Joint Engineering Team (JET) Guidelines for
                       Internationalized Domain Names (IDN) Registration
                       and Administration for Chinese, Japanese, and
                       Korean", RFC 3743, April 2004.

   [RFC3912]           Daigle, L., "WHOIS Protocol Specification",
                       RFC 3912, September 2004.


Klensin, et al.              Informational                     [Page 34]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


   [RFC3981]           Newton, A. and M. Sanz, "IRIS: The Internet
                       Registry Information Service (IRIS) Core
                       Protocol", RFC 3981, January 2005.

   [RFC3982]           Newton, A. and M. Sanz, "IRIS: A Domain Registry
                       (dreg) Type for the Internet Registry Information
                       Service (IRIS)", RFC 3982, January 2005.

   [RFC3986]           Berners-Lee, T., Fielding, R., and L. Masinter,
                       "Uniform Resource Identifier (URI): Generic
                       Syntax", STD 66, RFC 3986, January 2005.

   [RFC3987]           Duerst, M. and M. Suignard, "Internationalized
                       Resource Identifiers (IRIs)", RFC 3987,
                       January 2005.

   [RFC4185]           Klensin, J., "National and Local Characters for
                       DNS Top Level Domain (TLD) Names", RFC 4185,
                       October 2005.

   [RFC4290]           Klensin, J., "Suggested Practices for
                       Registration of Internationalized Domain Names
                       (IDN)", RFC 4290, December 2005.

   [RFC4645]           Ewell, D., "Initial Language Subtag Registry",
                       RFC 4645, September 2006.

   [RFC4646]           Phillips, A. and M. Davis, "Tags for Identifying
                       Languages", BCP 47, RFC 4646, September 2006.

   [UTR]               Unicode Consortium, "Unicode Technical Reports",
                       <http://www.unicode.org/reports/>.

   [UTR36]             Davis, M. and M. Suignard, "Unicode Technical
                       Report #36: Unicode Security Considerations",
                       November 2005, <http://www.unicode.org/draft/
                       reports/tr36/tr36.html>.

   [UTR39]             Davis, M. and M. Suignard, "Unicode Technical
                       Standard #39 (proposed): Unicode Security
                       Considerations", July 2005, <http://
                       www.unicode.org/draft/reports/tr39/tr39.html>.

   [Unicode-PR29]      The Unicode Consortium, "Public Review Issue #29:
                       Normalization Issue", Unicode PR 29,
                       February 2004.

   [Unicode10]         The Unicode Consortium, "The Unicode Standard,


Klensin, et al.              Informational                     [Page 35]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


                       Version 1.0", 1991.

   [W3C-Localization]  Ishida, R. and S. Miller, "Localization vs.
                       Internationalization", W3C International/
                       questions/qa-i18n.txt, December 2005.

   [net-utf8]          Klensin, J. and M. Padlipsky, "Unicode Format for
                       Network Interchange", Work in Progress,
                       April 2006.

Authors' Addresses

   John C Klensin
   1770 Massachusetts Ave, #322
   Cambridge, MA  02140
   USA

   Phone: +1 617 491 5735
   EMail: john-ietf@jck.com


   Patrik Faltstrom
   Cisco Systems

   EMail: paf@cisco.com


   Cary Karp
   Swedish Museum of Natural History
   Box 50007
   Stockholm  SE-10405
   Sweden

   Phone: +46 8 5195 4055
   EMail: ck@nrm.museum


   IAB

   EMail: iab@iab.org


Klensin, et al.              Informational                     [Page 36]
^L
RFC 4690                 IAB -- IDN Next Steps            September 2006


Full Copyright Statement

   Copyright (C) The Internet Society (2006).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.

Acknowledgement

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).


Klensin, et al.              Informational                     [Page 37]
^L