Commit f98c3c59 authored by redshiftzero's avatar redshiftzero Committed by Cheryl Sabella

docs 36789: resolve incorrect note regarding UTF-8 (GH-13111)

parent af8646c8
...@@ -135,17 +135,22 @@ used than UTF-8.) UTF-8 uses the following rules: ...@@ -135,17 +135,22 @@ used than UTF-8.) UTF-8 uses the following rules:
UTF-8 has several convenient properties: UTF-8 has several convenient properties:
1. It can handle any Unicode code point. 1. It can handle any Unicode code point.
2. A Unicode string is turned into a sequence of bytes containing no embedded zero 2. A Unicode string is turned into a sequence of bytes that contains embedded
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be zero bytes only where they represent the null character (U+0000). This means
processed by C functions such as ``strcpy()`` and sent through protocols that that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent
can't handle zero bytes. through protocols that can't handle zero bytes for anything other than
end-of-string markers.
3. A string of ASCII text is also valid UTF-8 text. 3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of commonly used characters can be 4. UTF-8 is fairly compact; the majority of commonly used characters can be
represented with one or two bytes. represented with one or two bytes.
5. If bytes are corrupted or lost, it's possible to determine the start of the 5. If bytes are corrupted or lost, it's possible to determine the start of the
next UTF-8-encoded code point and resynchronize. It's also unlikely that next UTF-8-encoded code point and resynchronize. It's also unlikely that
random 8-bit data will look like valid UTF-8. random 8-bit data will look like valid UTF-8.
6. UTF-8 is a byte oriented encoding. The encoding specifies that each
character is represented by a specific sequence of one or more bytes. This
avoids the byte-ordering issues that can occur with integer and word oriented
encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending
on the hardware on which the string was encoded.
References References
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment