summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc5564.txt
blob: d9b413a3ddc1001bf0e93ac91ab7ae2930b8a683 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
Independent Submission                                    A. El-Sherbiny
Request for Comments: 5564                                      M. Farah
Category: Informational                                         UN-ESCWA
ISSN: 2070-1721                                              I. Oueichek
                                            Syrian Telecom Establishment
                                                             A. Al-Zoman
                                                          SaudiNIC, CITC
                                                           February 2010


                  Linguistic Guidelines for the Use of
                the Arabic Language in Internet Domains

Abstract

   This document constitutes technical specifications for the use of
   Arabic in Internet domain names and provides linguistic guidelines
   for Arabic domain names.  It addresses Arabic-specific linguistic
   issues pertaining to the use of Arabic language in domain names.

Status of This Memo

   This document is not an Internet Standards Track specification; it is
   published for informational purposes.

   This is a contribution to the RFC Series, independently of any other
   RFC stream.  The RFC Editor has chosen to publish this document at
   its discretion and makes no statement about its value for
   implementation or deployment.  Documents approved for publication by
   the RFC Editor are not a candidate for any level of Internet
   Standard; see Section 2 of RFC 5741.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc5564.
















El-Sherbiny, et al.           Informational                     [Page 1]
^L
RFC 5564               Arabic Character Guidelines         February 2010


Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.

   This document may not be modified, and derivative works of it may not
   be created, except to format it for publication as an RFC or to
   translate it into languages other than English.

Table of Contents

   1. Introduction ....................................................2
   2. Arabic Language-Specific Issues .................................3
      2.1. Linguistic Issues ..........................................4
           2.1.1. Diacritics (Tashkeel) and Shadda ....................4
           2.1.2. Kasheeda or Tatweel (Horizontal Character
                  Size Extension) .....................................5
           2.1.3. Character Folding ...................................5
      2.2. Supported Character Set ....................................6
      2.3. Arabic Linguistic Issues Affected by Technical
           Constraints ................................................8
           2.3.1. Numerals ............................................8
           2.3.2. The Space Character .................................8
   3. Summary and Conclusion ..........................................8
   4. Security Considerations .........................................9
   5. Acknowledgments .................................................9
   6. References ......................................................9
      6.1. Normative References .......................................9
      6.2. Informative References .....................................9

1.  Introduction

   The Internet Engineering Task Force (IETF) issued in March 2003 a set
   of RFCs for Internationalized Domain Names (IDN) ([1], [2], and [3]),
   which were planned to become the de facto standard for all languages.
   In 2007 and 2008, the following working drafts were released that
   propose revisions to the IDNA protocol:

   o  Internationalized Domain Names for Applications (IDNA):
      Background, Explanation, and Rationale [5]




El-Sherbiny, et al.           Informational                     [Page 2]
^L
RFC 5564               Arabic Character Guidelines         February 2010


   o  Internationalized Domain Names in Applications (IDNA): Protocol
      [6]

   o  An updated IDNA criterion for right-to-left scripts [7]

   o  The Unicode code points and IDNA [8]

   These documents are known collectively as "IDNA2008".

   This document constitutes a technical specification for the
   implementation of the IDN standards in the case of the Arabic
   language.  It will allow the use of standard language tables to write
   domain names in Arabic characters.  Therefore, it should be
   considered as a logical extension to the IDN standards.  It thus
   presents guidelines for the proper use of Arabic characters with the
   IDN standards in an Arabic language context.

   This document reflects the recommendations of the Arab Working Group
   on Arabic Domain Names (AWG-ADN), established by the League of Arab
   States (LAS), based on standardisation efforts of the United Nations
   Economic and Social Commission for Western Asia (UN-ESCWA) and on
   that group's document, "Guidelines for an Arabic Internet Domain
   Name" [9].  This document is also in full harmony with recent
   rigorous discussions that took place within the major language
   communities that use the Arabic script in their languages.

   This document provides guidelines for the ways Arabic characters may
   be used for registering Internet domain names and how linguistic-
   specific issues should be handled.  A few rules are recommended for
   application at the protocol level.

   The key words "MUST", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY"
   in this document are to be interpreted as described in RFC 2119 [4].

   Comments on this document are solicited and should be addressed to
   the working group's mailing list at ESCWA-ICTD@un.org and/or the
   author(s).

2.  Arabic Language-Specific Issues

   The main objective of the creation of Arabic domain names is to have
   a vehicle to increase Internet use amongst all strata of the Arabic-
   speaking communities.

   Furthermore, a non-user-friendly domain name would further add to the
   ambiguity and the eccentricity of the Internet to the Arabic-speaking
   communities, thus contributing negatively to the spread of the




El-Sherbiny, et al.           Informational                     [Page 3]
^L
RFC 5564               Arabic Character Guidelines         February 2010


   Internet and leading to further isolation of these communities at the
   global level.

   Hence, there have been intensive efforts (especially those
   spearheaded by Dr. Al-Zoman and contributed to by UN-ESCWA and its
   Arabic Domain Names Task Force (ADN-TF)) to reach consensus on a
   multitude of linguistic issues with the following goals:

   o  To define the accepted Arabic character set to be used for writing
      domain names in Arabic, which is the subject of this document.

   o  To define the top-level domains of the Arabic domain name tree
      structure (i.e., Arabic gTLDs and ccTLDs).  This goal will be
      handled in a separate document.

   The first meeting of the AWG-ADN, held in Damascus from January-
   February 2005, gave special attention to the following:

   o  Simplification of the domain names, whenever possible, to
      facilitate the interaction of the Arabic user with the Internet.

   o  Adoption of solutions that do not lead to confusion either in
      reading or in writing, provided that this does not compromise the
      linguistic correctness of used words.

   o  Mixing Arabic and non-Arabic letters in the domain name label is
      not acceptable.

2.1.  Linguistic Issues

   There are a number of linguistic issues that have been proposed with
   respect to the use of the Arabic language in domain names.  This
   section will highlight some of them.  This section is based on the
   papers of Dr. Al-Zoman ([10] and [11]) and on the report of the first
   meeting of AWG-ADN [12].  For details, the reader is encouraged to
   review these references.

2.1.1.  Diacritics (Tashkeel) and Shadda

   Tashkeel and Shadda are accent marks placed above or below Arabic
   letters to produce proper pronunciation.  They are thus used to
   differentiate different meanings for different words with the same
   base characters.

   Neither Tashkeel nor Shadda are permitted in zone files when
   registering domain names in the Arabic language, although they are
   permitted in the current edition of IDNA2008.  They can be supported




El-Sherbiny, et al.           Informational                     [Page 4]
^L
RFC 5564               Arabic Character Guidelines         February 2010


   or ignored, if necessary, in the user interface with local mappings
   and can be stripped before IDNA processing.

   The following are their Unicode presentations:

      U+064B ARABIC FATHATAN
      U+064C ARABIC DAMMATAN
      U+064D ARABIC KASRATAN
      U+064E ARABIC FATHA
      U+064F ARABIC DAMMA
      U+0650 ARABIC KASRA
      U+0651 ARABIC SHADDA
      U+0652 ARABIC SUKUN

2.1.2.  Kasheeda or Tatweel (Horizontal Character Size Extension)

   Kasheeda (U+0640 ARABIC TATWEEL) must not be used in Arabic domain
   names and should be disallowed for Arabic language domain names.  The
   Kasheeda is not a letter and does not have an effect on
   pronunciation.  It is used to extend the horizontal length or change
   the shape of the preceding letter for graphical representation
   purposes in Arabic writing.  Accordingly, it has no value for the
   writing of domain names.  The same applies to all languages using the
   Arabic script.  The authors recommend that it should be disallowed at
   the protocol level.

2.1.3.  Character Folding

   Character folding is the process where multiple letters (that may
   have some similarity with respect to their shapes) are folded into
   one shape.  Examples of such Arabic characters include:

   o  Folding Teh Marbuta (U+0629) and Heh (U+0647) at the end of a word

   o  Folding different forms of Hamzah (U+0622, U+0623, U+0625, U+0627)

   o  Folding Alef Maksura (U+0649) and Yeh (U+064A) at the end of a
      word

   o  Folding Waw with Hamzah Above (U+0624) and Waw (U+0648)

   With respect to the Arabic language, character folding is not
   acceptable because it changes the meaning of words and is against the
   principle of spelling rules.  Replacing a character valid for use in
   domain names with another character also valid for use in domain
   names, which may have a similar shape, will give a different meaning.
   This will lead to only one word representing several words consisting




El-Sherbiny, et al.           Informational                     [Page 5]
^L
RFC 5564               Arabic Character Guidelines         February 2010


   of all the combinations of folded characters.  Hence, the other words
   will be masked by a single word [10].

   Mis-spelling or handwriting errors do occur, leading to mixing
   different characters despite the fact that this is not the case in
   published and printed materials.  One of the motivations of this
   effort is to preserve the language, particularly with the spread of
   the globalization movement.  Within this context, character folding
   is working against this motivation since it is going to have a
   negative effect on the principle and ethics of the language.
   Technology should work to preserve the language and not to destroy
   it.  Thus, character folding should not be allowed.  The case of
   digits is treated in a separate section below.

2.2.  Supported Character Set

   A domain name to be written in Arabic must be composed of a sequence
   of the following UNICODE characters and the FULL STOP (u+002E) to
   separate the labels.  These are based on UNICODE version 5.0.  The
   tables below are constructed using an inclusion-based approach.
   Thus, characters that are not part of these tables are prohibited.

             +---------+-------------------------------------+
             | Unicode | Character Name                      |
             +---------+-------------------------------------+
             | 0621    | ARABIC LETTER HAMZA                 |
             | 0622    | ARABIC LETTER ALEF WITH MADDA ABOVE |
             | 0623    | ARABIC LETTER ALEF WITH HAMZA ABOVE |
             | 0624    | ARABIC LETTER WAW WITH HAMZA ABOVE  |
             | 0625    | ARABIC LETTER ALEF WITH HAMZA BELOW |
             | 0626    | ARABIC LETTER YEH WITH HAMZA ABOVE  |
             | 0627    | ARABIC LETTER ALEF                  |
             | 0628    | ARABIC LETTER BEH                   |
             | 0629    | ARABIC LETTER TEH MARBUTA           |
             | 062A    | ARABIC LETTER TEH                   |
             | 062B    | ARABIC LETTER THEH                  |
             | 062C    | ARABIC LETTER JEEM                  |
             | 062D    | ARABIC LETTER HAH                   |
             | 062E    | ARABIC LETTER KHAH                  |
             | 062F    | ARABIC LETTER DAL                   |
             | 0630    | ARABIC LETTER THAL                  |
             | 0631    | ARABIC LETTER REH                   |
             | 0632    | ARABIC LETTER ZAIN                  |
             | 0633    | ARABIC LETTER SEEN                  |
             | 0634    | ARABIC LETTER SHEEN                 |
             | 0635    | ARABIC LETTER SAD                   |
             | 0636    | ARABIC LETTER DAD                   |
             | 0637    | ARABIC LETTER TAH                   |



El-Sherbiny, et al.           Informational                     [Page 6]
^L
RFC 5564               Arabic Character Guidelines         February 2010


             | 0638    | ARABIC LETTER ZAH                   |
             | 0639    | ARABIC LETTER AIN                   |
             | 063A    | ARABIC LETTER GHAIN                 |
             | 0641    | ARABIC LETTER FEH                   |
             | 0642    | ARABIC LETTER QAF                   |
             | 0643    | ARABIC LETTER KAF                   |
             | 0644    | ARABIC LETTER LAM                   |
             | 0645    | ARABIC LETTER MEEM                  |
             | 0646    | ARABIC LETTER NOON                  |
             | 0647    | ARABIC LETTER HEH                   |
             | 0648    | ARABIC LETTER WAW                   |
             | 0649    | ARABIC LETTER ALEF MAKSURA          |
             | 064A    | ARABIC LETTER YEH                   |
             | 0660    | ARABIC-INDIC DIGIT ZERO             |
             | 0661    | ARABIC-INDIC DIGIT ONE              |
             | 0662    | ARABIC-INDIC DIGIT TWO              |
             | 0663    | ARABIC-INDIC DIGIT THREE            |
             | 0664    | ARABIC-INDIC DIGIT FOUR             |
             | 0665    | ARABIC-INDIC DIGIT FIVE             |
             | 0666    | ARABIC-INDIC DIGIT SIX              |
             | 0667    | ARABIC-INDIC DIGIT SEVEN            |
             | 0668    | ARABIC-INDIC DIGIT EIGHT            |
             | 0669    | ARABIC-INDIC DIGIT NINE             |
             +---------+-------------------------------------+

        Source: Supporting the Arabic Language in Domain Names [10]
         Table 1: CHARACTERS FROM UNICODE ARABIC TABLE (0600-06FF)

                       +---------+-----------------+
                       | Unicode | Digit Name      |
                       +---------+-----------------+
                       | 0030    | DIGIT ZERO      |
                       | 0031    | DIGIT ONE       |
                       | 0032    | DIGIT TWO       |
                       | 0033    | DIGIT THREE     |
                       | 0034    | DIGIT FOUR      |
                       | 0035    | DIGIT FIVE      |
                       | 0036    | DIGIT SIX       |
                       | 0037    | DIGIT SEVEN     |
                       | 0038    | DIGIT EIGHT     |
                       | 0039    | DIGIT NINE      |
                       | 002D    | HYPHEN-MINUS    |
                       +---------+-----------------+

        Source: Supporting the Arabic Language in Domain Names [10]
      Table 2: CHARACTERS FROM UNICODE BASIC LATIN TABLE (0000-007F)





El-Sherbiny, et al.           Informational                     [Page 7]
^L
RFC 5564               Arabic Character Guidelines         February 2010


2.3.  Arabic Linguistic Issues Affected by Technical Constraints

   In this section, technical aspects of some linguistic issues are
   discussed.

2.3.1.  Numerals

   In the Arab countries, there are two sets of numerical digits used:

   o  Set I: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) mostly used in the western
      part of the Arab world.

   o  Set II: (u+0660, u+0661, u+0662, u+0663, u+0664, u+0665, u+0666,
      u+0667, u+0668, u+0669) mostly used in the eastern part of the
      Arab world.

   Both sets may be supported in the user interface; however, the rule
   of numeral homogeneity must be observed.  The rule specifies that
   digits from the Arabic-Indic set of numerals (u+0660 to u+0669)
   should not be allowed to mix with ASCII digits (u+0030 to u+0039)
   within the same Arabic domain name label.  Thus, the appearance of a
   digit from one set prevents the use of any other digit from the other
   set.

2.3.2.  The Space Character

   The space character is strictly disallowed in domain names, as it is
   a control character.  Instead, the hyphen (Al-sharta, i.e., u+02D) is
   proposed as a separator between Arabic words to avoid confusion that
   can take place if the words are typed without a separator.

   It is acceptable to use the hyphen to separate between words within
   the same domain name label.

3.  Summary and Conclusion

   The proposed guidelines are in full accordance with the IETF IDN
   standards and take into account Arabic-language-specific issues
   within a compromise between grammatical rules of the Arabic language
   and ease of use of that language on the Internet.

   In summary, the guidelines specify that, in Arabic domain names:

   o  Accent marks (Tashkeel and Shadda) are not permitted.

   o  Character folding is not permitted.





El-Sherbiny, et al.           Informational                     [Page 8]
^L
RFC 5564               Arabic Character Guidelines         February 2010


   o  If a numeral from the Arabic-Indic or ASCII digit sets appears in
      a label, numeral homogeneity is required.

   o  The hyphen must be used as a word separator instead of space.

4.  Security Considerations

   No particular security considerations could be identified regarding
   the use of Arabic characters in writing domain names.  In particular,
   any potential visual confusion between different character strings is
   avoided using the guidelines proposed in this document.

5.  Acknowledgments

   ESCWA ICT Division provided support and funding for the development
   of this document with the objective of reaching a standard for
   comprehensive Arabic domain names.  Thanks are due to SaudiNIC for
   its continuous efforts in supporting the development of Arabic domain
   names.

   John Klensin and Harald Alvestrand reviewed the document and provided
   useful editorial and substantive support to enrich it.

6.  References

6.1.  Normative References

   [1]   Faltstrom, P., Hoffman, P., and A. Costello,
         "Internationalizing Domain Names in Applications (IDNA)", RFC
         3490, March 2003.

   [2]   Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile
         for Internationalized Domain Names (IDN)", RFC 3491, March
         2003.

   [3]   Costello, A., "Punycode: A Bootstring encoding of Unicode for
         Internationalized Domain Names in Applications (IDNA)", RFC
         3492, March 2003.

   [4]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", BCP 14, RFC 2119, March 1997.

6.2.  Informative References

   [5]   Klensin, J., "Internationalized Domain Names for Applications
         (IDNA): Definitions, Background and Rationale", Work in
         Progress, September 2008.




El-Sherbiny, et al.           Informational                     [Page 9]
^L
RFC 5564               Arabic Character Guidelines         February 2010


   [6]   Klensin, J., "Internationalized Domain Names in Applications
         (IDNA): Protocol", Work in Progress, September 2008.

   [7]   Alvestrand, H. and C. Karp, "An updated IDNA criterion for
         right-to-left scripts", Work in Progress, July 2008.

   [8]   Faltstrom, P., "The Unicode Codepoints and IDNA", Work in
         Progress, July 2008.

   [9]   United Nations Economic and Social Commission for Western Asia
         (UN-ESCWA), "Guidelines for an Arabic Domain Name System
         (ADNS)", Work in Progress, November 2007.

   [10]  Al-Zoman, A., "Supporting the Arabic Language in Domain Names",
         October 2003, <http://www.arabic-domains.org/docs/
         NIC-docs/SupportingArabicDomainNmaes.pdf>.

   [11]  Al-Zoman, A., "Arabic Top-Level Domains", Paper presented in
         Expert Group Meeting on Promotion of Digital Arabic Content,
         the United Nations, Economic and Social Commission for Western
         Asia, Beirut, June 2003.

   [12]  League of Arab States, "Report of the first meeting of AWG-ADN,
         Damascus", February 2005, <http://www.arabic-
         domains.org/ar/intrnational-entites.php>.


























El-Sherbiny, et al.           Informational                    [Page 10]
^L
RFC 5564               Arabic Character Guidelines         February 2010


Authors' Addresses

   Ayman El-Sherbiny
   Information and Communication Technology Division ESCWA
   UN-House
   P.O. Box 11-8575
   Beirut
   Lebanon

   EMail: El-sherbiny@un.org


   Mansour Farah
   Information and Communication Technology Division ESCWA
   UN-House
   P.O. Box 11-8575
   Beirut
   Lebanon

   EMail: farah14@un.org


   Ibaa Oueichek
   Syrian Telecom Establishment
   Damascus
   Syria

   EMail: oueichek@scs-net.org


   Abdulaziz H. Al-Zoman, PhD
   SaudiNIC, General Directorate of Internet Services
   IT Sector, CITC
   King Abdulaziz City for Science and Technology
   PO Box 6086
   Riyadh  11442
   Saudi Arabia

   EMail: azoman@citc.gov.sa












El-Sherbiny, et al.           Informational                    [Page 11]
^L