basis/unicode/UCA/CollationTest.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   2        "http://www.w3.org/TR/html4/loose.dtd">
   3 <html>
   4 <head>
   5 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   6 <meta http-equiv="Content-Language" content="en-us">
   7 <link rel="stylesheet" href="https://www.unicode.org/reports/reports-v2.css" type="text/css">
   8 <title>UCA Conformance Tests</title>
   9 </head>
  10
  11 <body>
  12
  13 <table class="header">
  14   <tbody><tr>
  15     <td class="icon"><a href="https://www.unicode.org">
  16     <img alt="[Unicode]" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle" border="0" height="33" width="34"></a>&nbsp;&nbsp;
  17     <a class="bar" href="https://www.unicode.org/reports/tr10/">Unicode Collation Algorithm</a></td>
  18   </tr>
  19   <tr>
  20     <td class="gray">&nbsp;</td>
  21   </tr>
  22 </tbody></table>
  23 <div class="body">
  24   <h1>Unicode® Collation Algorithm<br>Conformance Tests</h1>
  25   <h2 align="center">Version 15.0.0<br>2022-08-12</h2>
  26 <p>The following files provide conformance tests for the Unicode Collation Algorithm
  27   (<a href="https://www.unicode.org/reports/tr10/tr10-47.html">UTS #10: Unicode Collation Algorithm</a>).</p>
  28   <ul>
  29     <li>CollationTest_SHIFTED.txt</li>
  30     <li>CollationTest_NON_IGNORABLE.txt</li>
  31     <li>CollationTest_SHIFTED_SHORT.txt</li>
  32     <li>CollationTest_NON_IGNORABLE_SHORT.txt</li>
  33   </ul>
  34   <p>These files are large, and thus packaged in zip format to save download time.</p>
  35
  36   <blockquote>
  37     <p><b>Note:</b> These files test the sort order of an untailored DUCET table.
  38     If you are using an implementation of the
  39     <a href="https://www.unicode.org/reports/tr35/tr35-collation.html#CLDR_Collation_Algorithm">CLDR Collation Algorithm</a>
  40     with its <a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation">tailored root collation data</a>,
  41     for example ICU or a library that uses ICU for collation,
  42     then you need to test with files that reflect that sort order.
  43     The CLDR collation conformance test files have
  44     the same names (except for an added _CLDR infix)
  45     and structures as the ones here for the DUCET.
  46     You can find them in the <a href="https://github.com/unicode-org/cldr/tree/main/common/uca">CLDR GitHub repo in the folder “common/uca”</a>,
  47     or in the <a href="https://www.unicode.org/Public/cldr/">CLDR data file download area</a>,
  48     in the “cldr-common-*.zip” file, again in the folder “common/uca”.
  49     Select the files for the version of CLDR that is used in the implementation.</p>
  50   </blockquote>
  51
  52 <h2>Format</h2>
  53   <p>There are four different files:</p>
  54   <ul>
  55     <li>The shifted vs non-ignorable files correspond to the two alternate
  56       <a href="https://www.unicode.org/reports/tr10/tr10-47.html#Variable_Weighting">Variable Weighting</a> values.</li>
  57     <li>The SHORT versions omit the comments, for more compact storage.</li>
  58   </ul>
  59 <p>The format is illustrated by the following example:</p>
  60   <pre>0385 0021;  # (΅) GREEK DIALYTIKA TONOS  [0316 015D | 0020 0032 0020 | 0002 0002 0002 |]</pre>
  61   <p>The part before the semicolon is the hex representation of a sequence of Unicode code points.
  62   After the hash mark is a comment. This comment is purely informational, and may change in the
  63   future. Currently it consists of the characters of the sequence in parentheses,
  64   the name of the first code point, and a representation of
  65   the sort key for the sequence.</p>
  66   <p>The sort key representation is in square brackets. It uses a vertical bar for the ZERO
  67   separator. Between the bars are the primary, secondary, tertiary, and quaternary weights (if any),
  68   in hex.</p>
  69   <blockquote>
  70     <p><b>Note:</b> The sort key is purely informational. UCA does <i>not</i>
  71     require the production of any particular sort key, as long as the results of comparisons
  72     match.</p>
  73   </blockquote>
  74
  75   <h2>Testing</h2>
  76   <p>The files are designed so each line in the file will order as being greater than or equal to
  77   the previous one, when using the UCA and the
  78   <a href="https://www.unicode.org/reports/tr10/tr10-47.html#Default_Unicode_Collation_Element_Table">Default
  79   Unicode Collation Element Table</a>.
  80   A test program can read in each line, compare it to
  81   the last line, and signal an error if order is not correct. The exact comparison that should be
  82   used is as follows:</p>
  83   <ol>
  84     <li>Read the next line.</li>
  85     <li>Parse each sequence up to the semicolon, and convert it into a Unicode string.</li>
  86     <li>Compare that string with the string on the previous line, according to the UCA
  87     implementation, with strength = identical level (using S3.10).</li>
  88     <li>If the last string is greater than the current string, then stop with an error.</li>
  89     <li>Continue to the next line (step 1).</li>
  90   </ol>
  91   <p>If there are any errors, then the UCA implementation is not compliant. </p>
  92   <p>These files contain test cases that include ill-formed strings, with surrogate code points.
  93   Implementations that do not weight surrogate code points the same way as reserved code points
  94   may filter out such lines lines in the test cases, before testing for conformance.</p>
  95
  96   <h2>Migration</h2>
  97   <h3>Tie-breaker</h3>
  98   <p>Beginning with UCA 6.2,
  99   the test data strings are compared with strength = identical,
 100   using UCA S3.10 as a tie-breaker, which compares the NFD forms of the strings in code point order.
 101   Before UCA 6.2, the test files did not use strength = identical,
 102   and instead used as a tie-breaker the comparison of the unnormalized strings.<br>
 103   Therefore, implementations which use the UCA test files to test
 104   multiple versions of UCA need to use different tie-breaker comparisons
 105   depending on the UCA version.</p>
 106
 107   <h3>Discontiguous contractions</h3>
 108   <p>Test data files for UCA 6.1 and earlier versions were generated with code that
 109   had a bug in the contraction matching.
 110   In that code, matches for certain contractions of Tibetan characters were found
 111   despite intervening combining marks,
 112   so that some test cases were not in proper order according to the UCA and the DUCET.
 113   UCA 6.2 test files omitted the relevant test cases.
 114   For UCA 6.3, the test data generation code was fixed and those test cases were restored.</p>
 115
 116   <p>For example, in the defective test data generation code,
 117   the strings 0FB2 0F80 0F71 0334 and 0F77 0334 compared equal.
 118   (U+0F77 is the TIBETAN VOWEL SIGN VOCALIC RR.)
 119   However, UCA processing with the DUCET will not find the contraction 0FB2 0F71 0F80:</p>
 120   <ul>
 121     <li>UCA Step 1 normalizes 0FB2 0F80 0F71 0334 to 0FB2 0334 0F71 0F80.</li>
 122     <li>Step 2.1 only finds a match for S=0FB2.</li>
 123     <li>S2.1.1 loops over each of the following three characters C,
 124     but there is no table entry for any of those three S+C.
 125     In particular, there is no DUCET mapping for 0FB2+0F71
 126     (see <i><a href="https://www.unicode.org/reports/tr10/tr10-47.html#Well_Formed_DUCET">Tibetan and
 127     Well-Formedness of DUCET</a></i>).</li>
 128     <li>The loop exits without finding any match beyond S=0FB2.</li>
 129   </ul>
 130
 131   <p>See “Also note that the Algorithm employs two distinct contraction matching methods:”
 132   at the end of <i>Section 7.2,
 133   <a href="https://www.unicode.org/reports/tr10/tr10-47.html#Step_2">Produce Collation Element Arrays</a></i>.</p>
 134
 135   <hr width="50%">
 136   <p class="copyright">© 2022 Unicode, Inc. All Rights Reserved.
 137   The Unicode Consortium makes no expressed or implied warranty
 138   of any kind, and assumes no liability for errors or omissions. No liability
 139   is assumed for incidental and consequential damages in connection with or arising
 140   out of the use of the information or programs contained or accompanying this
 141   technical report. The Unicode
 142   <a href="https://www.unicode.org/copyright.html">Terms of Use</a> apply.
 143   </p>
 144   <p class="copyright">Unicode and the Unicode logo are trademarks
 145   of Unicode, Inc., and are registered in some jurisdictions.</p>
 146 </div>
 147
 148 </body></html>