Skip to content
Back to Blog
Tutorials

UTF-8 vs UTF-16 vs Unicode Encoding Guide for Developers

UTF-8 vs UTF-16 vs UTF-32 explained for developers — codepoints, surrogate pairs, BOM, MySQL utf8mb4 traps, and JS length lies. Learn how to pick the right encoding.

12 min read

UTF-8 vs UTF-16 vs Unicode Encoding Guide for Developers

Short answer to what most searches for utf-8 unicode encoding are actually asking: Unicode and UTF-8 are not the same thing. Unicode is a giant numbered table that assigns a codepoint (a number like U+1F600) to every character. UTF-8, UTF-16, and UTF-32 are byte representations, three different ways of turning those codepoints into bytes.

UTF-8 is the one you almost always want. It is byte-identical to ASCII for English text, scales to four bytes for every emoji, and is mandated by JSON, HTML5, and most modern protocols.

This guide is for the developer who has been bitten: the MySQL Incorrect string value error on a 😀, the JavaScript surprise of "😀".length === 2, the CSV that opens fine in cat but garbled in Excel. We will walk from codepoints up through UTF-8 byte mechanics, surrogate pairs, BOMs, nine languages’ default behavior, and eight production pitfalls, then close with a decision matrix and FAQ.

Want to verify a byte sequence as you read? Paste any string into the Base64 Decoder/Encoder. The decoded payload is exactly the UTF-8 byte stream that this article explains.

Why encoding still bites you in 2026

Three scenarios, all from real bug trackers in the last twelve months:

  1. MySQL rejects an emoji. A user submits Hello 😀 and the server returns Incorrect string value: '\xF0\x9F\x98\x80'. The table is utf8, the developer thinks “that’s UTF-8, what’s wrong?”, and the answer is buried in MySQL history (covered in section 7).
  2. A character counter ships broken. A 280-character tweet validator uses text.length, accepts a message full of emoji, and the API rejects it. The reverse also happens: a valid post is refused by the front end. Symptom diagnosed in section 4.
  3. Local HTML turns into “中文”. A developer saves a file in Windows-1252, opens it in a browser that guesses UTF-8, and watches Mojibake bloom. This is the BOM / charset declaration story in section 5, with parallels to the URL Encoding & Decoding Guide where the same byte-vs-character mismatch wrecks query strings.

What you get out of this guide: by the last page you will (a) distinguish Unicode from UTF-8 in one sentence, (b) pick between UTF-8, UTF-16, and UTF-32 for any new project, (c) write code that correctly counts emoji in every major language, and (d) debug any charset bug from byte stream alone. The character encoding rabbit hole is deep, but the working surface area you need day to day is small.

What is Unicode? Codepoints vs characters vs glyphs

Unicode is a character table that assigns a unique number, a codepoint such as U+1F600, to every character. UTF-8, UTF-16, and UTF-32 are encodings that translate those codepoints into bytes. Unicode itself stores no bytes; it only defines the mapping from abstract character to integer.

Three more terms get tangled because they often refer to the same visible mark:

Three layers you must separate

  • Codepoint (U+0041, U+1F600): the integer Unicode assigns. The space runs from U+0000 to U+10FFFF, roughly 1.1 million slots, of which about 150,000 are currently assigned.
  • Character (or abstract character): the semantic identity, Latin capital A, grinning face emoji.
  • Glyph: the visual shape a font renders. One character has many glyphs: a serif A, an italic A, a hand-drawn A. Unicode does not care about glyphs.
  • Grapheme cluster: what a user perceives as a single “character.” Often one codepoint, sometimes several. The letter á can be one codepoint U+00E1 or two codepoints a + U+0301 (combining acute accent). The character limits across Twitter, SMS, and SEO covers how each platform draws this line differently.

If you remember nothing else, remember: codepoint, encoding, bytes, rendering. Each arrow can break independently.

Codepoint notation, U+XXXX and \uXXXX

You will see codepoints written in several flavors. U+0041 is the canonical Unicode notation: four to six hex digits, prefixed U+. In source code:

  • JavaScript / JSON: "A" (four hex digits, BMP only) and "\u{1F600}" (ES6 braces, any codepoint).
  • Python: "A" (4 digits), "\U00000041" (8 digits, capital U), "\N{LATIN CAPITAL LETTER A}" (by name).
  • Shell / git log / sed output: you often see raw UTF-8 bytes such as \xc3\xa9 for é. That is not a codepoint, that is the encoded form, which leads us to section 3.

The 17 planes, BMP and beyond

Unicode partitions its codepoint space into 17 planes of 65,536 codepoints each, 17 × 2^16 = 1,114,112.

  • Plane 0, the Basic Multilingual Plane (BMP): U+0000 to U+FFFF. Latin, CJK ideographs, Cyrillic, Arabic, Greek, almost every script you encounter in legacy text lives here.
  • Planes 1-16, the supplementary planes: U+10000 to U+10FFFF. Most emoji (U+1F600 and friends), rare CJK characters, historical scripts (Egyptian hieroglyphs, cuneiform), musical notation.

The BMP / supplementary boundary at U+FFFF is the single most important number in this article. It is where UTF-16 stops being one code unit per character, where UTF-8 jumps from three bytes to four, and where MySQL’s misnamed utf8 collation gives up.

Quick sanity check with emoji

"a"        → 1 codepoint  U+0061             → 1 grapheme
"é" (NFC)  → 1 codepoint  U+00E9             → 1 grapheme
"é" (NFD)  → 2 codepoints U+0065 U+0301      → 1 grapheme
"😀"        → 1 codepoint  U+1F600 (Plane 1)  → 1 grapheme
"👨‍👩‍👧"      → 5 codepoints (3 people + 2 ZWJ U+200D) → 1 grapheme

The last row is where things get awkward. The family emoji is one user-perceived character, but five codepoints joined by Zero-Width Joiners. Every layer of the stack can count it differently, and section 7 trap 6 is the bug report this disagreement files.

UTF-8 encoding mechanics, how 1-4 bytes work

UTF-8 encodes Unicode codepoints in 1 to 4 bytes. ASCII (U+0000U+007F) uses 1 byte and is byte-identical to ASCII. Higher codepoints use multi-byte sequences where the first byte signals total length and every continuation byte starts with the bit pattern 10xxxxxx. This self-describing layout is the reason UTF-8 came out on top.

The byte-pattern table, UTF-8 in one diagram

Codepoint rangeUTF-8 bytesByte pattern
U+0000U+007F1 byte0xxxxxxx
U+0080U+07FF2 bytes110xxxxx 10xxxxxx
U+0800U+FFFF3 bytes1110xxxx 10xxxxxx 10xxxxxx
U+10000U+10FFFF4 bytes11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Each x is a data bit drawn from the codepoint’s binary representation. The leading 0 / 110 / 1110 / 11110 tells the decoder how many bytes total; the leading 10 marks every continuation byte. That redundancy is what makes UTF-8 self-synchronizing: drop a byte and you can resume at the next start byte instead of corrupting everything downstream.

Worked example, encoding (U+4E2D)

Codepoint 0x4E2D falls in U+0800U+FFFF, so we use the 3-byte template.

  1. Binary: 0x4E2D = 0100 1110 0010 1101 (16 bits).
  2. Split 4-6-6 to fit the x slots: 0100 / 111000 / 101101.
  3. Substitute into 1110xxxx 10xxxxxx 10xxxxxx: 11100100 10111000 10101101.
  4. Hex: 0xE4 0xB8 0xAD.

That is exactly why becomes %E4%B8%AD after URL-encoding: percent-encoding wraps each UTF-8 byte in %XX, it does not encode the codepoint directly. Section 7 trap 3 details the chain.

Worked example, encoding 😀 (U+1F600)

Codepoint 0x1F600 exceeds the BMP, so we use the 4-byte template.

  1. Binary: 0x1F600 = 0 0001 1111 0110 0000 0000 (21 bits, padded).
  2. Split 3-6-6-6: 000 / 011111 / 011000 / 000000.
  3. Substitute into 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 11110000 10011111 10011000 10000000.
  4. Hex: 0xF0 0x9F 0x98 0x80.

Those four bytes are what MySQL’s utf8 collation chokes on. It allocates three bytes per character maximum. Section 7 trap 1 has the fix.

Why UTF-8 came out on top

ASCII compatibility. A file of pure ASCII text is identical at the byte level to its UTF-8 encoding. Decades of tools that predate Unicode (grep, awk, classic shell pipes) continue to work for that subset.

Self-synchronization. Continuation bytes always start with 10, which never collides with any start byte. Lose one byte in a network transfer and you resync at the next character boundary instead of cascading garbage down the pipe.

No byte order. UTF-8 is a stream of bytes, not 16-bit or 32-bit units, so endianness is irrelevant. UTF-16 and UTF-32 need a Byte Order Mark to declare which end goes first; UTF-8 does not, and usually should not (see section 5).

Invalid UTF-8, what the spec forbids

A strict decoder will reject these byte sequences:

  • 5- or 6-byte sequences. Early RFCs allowed them; RFC 3629 (2003) capped UTF-8 at 4 bytes to match the 21-bit Unicode space.
  • Overlong encodings. Encoding / as three bytes 0xE0 0x80 0xAF instead of one byte 0x2F. Once a fertile source of directory-traversal exploits in path validators that decoded after sanitizing.
  • Lone surrogate codepoints (U+D800U+DFFF). These are reserved for UTF-16 and should never appear in UTF-8.
  • Truncated sequences. A 3-byte start byte followed by only one continuation byte, common when user input is cut at a byte boundary in the middle of a multi-byte character.

To see any of this concretely, drop a string into the Base64 Decoder/Encoder, encode it, then decode it back as bytes. The byte array between encoder and decoder is the UTF-8 stream this section describes.

UTF-16 and surrogate pairs, why JavaScript length lies

The most common search around utf-8 vs utf-16 is really “why does "😀".length equal 2 in my code?” The answer is surrogate pairs, and it is a 1990s decision that JavaScript, Java, C#, and Windows all inherited.

UTF-16 in one paragraph

UTF-16 represents Unicode using 16-bit code units. Characters in the BMP (U+0000U+FFFF) take exactly one code unit. Characters in the supplementary planes (U+10000U+10FFFF) take two code units, called a surrogate pair: a high surrogate in U+D800U+DBFF followed by a low surrogate in U+DC00U+DFFF. That U+D800U+DFFF block is permanently reserved in Unicode so no real character lives there. UTF-16 is the internal string format for JavaScript, Java, C# (.NET), Windows kernel APIs, Objective-C NSString, and Qt, all designed when 65,536 characters looked like plenty.

The String.length trap

"a".length          // 1   — BMP, single code unit
"é".length          // 1   — BMP (U+00E9), single code unit
"中".length         // 1   — BMP (U+4E2D), single code unit
"😀".length         // 2   — supplementary plane (U+1F600), surrogate pair!
"a😀".length        // 3   — one BMP + two surrogate units

String.prototype.length reports the number of UTF-16 code units, not the number of characters. Anything from the supplementary plane reads as 2. The same trap exists in Java’s String.length() and C#‘s string.Length.

Counting codepoints correctly in JS

[..."😀"].length              // 1 — spread iterator walks codepoints
Array.from("😀").length       // 1 — Array.from also walks codepoints
"😀".match(/./gu).length      // 1 — /u flag = unicode-aware regex

// "😀".charAt(0) returns the lone high surrogate (visually broken)
"😀".codePointAt(0)           // 128512 — the full codepoint U+1F600

The spread operator and Array.from use the iterator protocol, which the language spec defines as walking codepoints. Plain index access (str[0], charAt) still returns code units and will hand you half a surrogate pair on emoji.

Python, len() already does the right thing (almost)

len("😀")           # 1   — Python 3 strings are codepoint-indexed
len("👨‍👩‍👧")        # 5   — codepoints (3 humans + 2 ZWJ), not graphemes
# Python 2 was byte-indexed by default — len("😀") returned 4

Python 3 stores strings in a flexible 1-, 2-, or 4-byte representation (PEP 393) and indexes by codepoint. len("😀") is 1, but it is still not the grapheme count. The family emoji still reads as 5. To count user-perceived characters you need a grapheme library: Intl.Segmenter in JavaScript (Node 22+, all current browsers), grapheme or regex in Python, or simply Swift, whose String.count is the only mainstream language that defaults to grapheme counting.

UTF-16 vs UCS-2, the silent migration

Before 1996, Unicode promised to fit in 16 bits and the corresponding encoding was UCS-2, a fixed 2-byte mapping. Unicode 2.0 broke that promise by adding the supplementary planes. UTF-16 is the patched version using surrogate pairs. The JavaScript spec still cites the old UCS-2 vocabulary in places, which is why the language tolerates lone surrogates that should be illegal. The “WTF-16” jokes are real. Web platform APIs (DOM, fetch, TextEncoder) reject lone surrogates because they cannot be encoded to valid UTF-8.

UTF-32, BOM, and the byte order question

UTF-32, the simple, wasteful one

UTF-32 uses a fixed 4 bytes per codepoint. U+0041 is stored as 0x00000041, U+1F600 as 0x0001F600. The advantage is constant-time random access: the n-th codepoint sits at byte offset 4n. The disadvantage is size. Pure ASCII text balloons to four times its UTF-8 footprint, and even CJK text doubles. Almost no system stores UTF-32 on disk. Internally, Python 3 chooses 1, 2, or 4 bytes per string based on the highest codepoint; the Linux fontconfig stack uses UTF-32 for its in-memory glyph tables.

Byte order, why endianness matters for UTF-16 / UTF-32

UTF-8 is a stream of single bytes, so endianness does not apply. UTF-16 and UTF-32 operate on multi-byte units, and different CPUs disagree about which end of a number comes first.

U+0041 ('A') in UTF-16 BE → 00 41
U+0041 ('A') in UTF-16 LE → 41 00

x86 and ARM CPUs are little-endian; older PowerPC and “network byte order” are big-endian. When you write a UTF-16 file you must commit to one and tell the reader which, which is what the BOM is for.

The BOM, what it is, when to use

A Byte Order Mark is U+FEFF placed at the start of a file. Encoded, it announces both the encoding and (for UTF-16 / UTF-32) the byte order.

EncodingBOM bytes
UTF-8EF BB BF
UTF-16 BEFE FF
UTF-16 LEFF FE
UTF-32 BE00 00 FE FF
UTF-32 LEFF FE 00 00

The utf-8 BOM exists, but it carries no byte-order information because UTF-8 has no byte order. Its only job is to declare “this file is UTF-8”, useful for tools that have no other signal, harmful for tools that expect the file to begin with a magic number or directive.

BOM decision matrix, should I add it?

FormatUTF-8 BOMUTF-16 BOMUTF-32 BOM
HTMLNo (breaks <!doctype> detection in old parsers)
JSONNo (RFC 8259 forbids it)
JavaScript / CSS sourceAvoid (older Node and IE choke)
CSV opened in ExcelYes (Excel reads non-BOM UTF-8 as ANSI and mangles CJK)
XMLOptional (XML declaration already states encoding)RequiredRequired
Plain text .txtOptional (Windows Notepad adds one by default)RequiredRequired

Short rule: drop the UTF-8 BOM from anything served on the web; add it to CSVs you want Excel to open; let the reader decide for everything else.

9 languages side-by-side, default encoding behavior

Cross-language work is where this knowledge pays off. The same string "a😀é" produces a different length in every runtime you call from your Bash script.

The cross-language behavior table

LanguageSource file encodingString storagelength / len countsDefault I/O encoding4-byte emoji safe?
JavaScript (V8 / SpiderMonkey)UTF-8UTF-16UTF-16 code unitsUTF-8 (Node, Web)Yes, but .length === 2
Python 3UTF-8 (PEP 3120)dynamic 1 / 2 / 4 byte (PEP 393)codepointsUTF-8 (PEP 540 since 3.7)Yes, len === 1
JavaUTF-8 (javac default)UTF-16UTF-16 code unitsplatform charset → UTF-8 (JEP 400, JDK 18+)Yes, but .length() === 2
GoUTF-8UTF-8 bytesbytes (utf8.RuneCountInString for codepoints)UTF-8Yes, len(s) returns bytes
RustUTF-8UTF-8 bytes (String invariant).len() bytes, .chars().count() codepointsUTF-8Yes, explicit
C# (.NET)UTF-8 (default since .NET Core 3.0)UTF-16UTF-16 code unitsUTF-8 (Encoding.Default since .NET 5)Yes, but .Length === 2
RubyUTF-8 (since 2.0)per-string encoding tagcodepoints (.length)UTF-8Yes, length === 1
PHP(no source encoding)byte stringbytes (strlen); mb_strlen for codepointsdepends on default_charsetYes, with mb_* family
MySQLcolumn charsetbytes (LENGTH), characters (CHAR_LENGTH)character_set_* system varsOnly with utf8mb4

How to read the table

There are three philosophies in play, each with its own failure mode:

  • UTF-8 internal (Go, Rust, Ruby). The native string is bytes; length is well-defined but counts what it counts. Convert to codepoints or graphemes only when you cross a UI or validation boundary.
  • UTF-16 internal (JavaScript, Java, C#). Inherited from 1990s assumptions; length is code units, surrogate pair counts as 2. Use codepoint-aware iteration for any user-facing count.
  • Codepoint-indexed (Python 3). len gives codepoints, which feels right until you meet ZWJ emoji, at which point you still need a grapheme library.

PHP is the special case. Its built-in str* functions all operate on bytes, treating UTF-8 sequences as opaque blobs. Every non-ASCII project must use the mb_* (multibyte) family, and the recurring bug reports show how often that gets missed.

Working rule: keep UTF-8 as the wire format everywhere (files, HTTP bodies, database columns) and convert to your runtime’s native string type at the boundary. This is the “UTF-8 sandwich” we return to in section 8.

8 real-world encoding pitfalls: Mojibake, utf8mb4, and charset detection

The patterns below come up in every code review on a globalized codebase.

Trap 1: MySQL utf8 is a 3-byte lie, switch to utf8mb4

Symptom. INSERT INTO users (bio) VALUES ('Hello 😀'); returns Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio'.

Root cause. MySQL’s historical utf8 is an alias for utf8mb3: a UTF-8 variant capped at three bytes per character. Any codepoint above U+FFFF (every emoji, several thousand rare CJK characters, all historical scripts) requires four UTF-8 bytes and is rejected.

Fix.

ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
SET NAMES utf8mb4;  -- client connection
# my.cnf
[mysqld]
character-set-server = utf8mb4
collation-server     = utf8mb4_unicode_ci

MySQL 8.0 still ships utf8 as the utf8mb3 alias. utf8mb3 is deprecated but not yet removed. Use utf8mb4 for every new column, every new database, every new connection. There is no upside to the legacy variant.

Trap 2: Windows-1252 fallback, the question mark mystery

Symptom. A .txt exported from a Windows colleague’s Notepad reads "smart quotes" and an em dash on their machine. On your server it becomes ? or U+FFFD (replacement character).

Root cause. Older Notepad defaults to Windows-1252 (CP-1252), which encodes the curly quote " as 0x93. A UTF-8 decoder sees 0x93 as a stray continuation byte (high bit 10) without a preceding start byte and substitutes the replacement character.

Fix. Detect the source encoding (file on Unix, chardet / charset-normalizer in Python, jschardet in Node), decode with the correct codec, then re-encode as UTF-8 before saving. Standardizing on UTF-8 at ingestion stops the recurrence.

Trap 3: URL percent-encoding ≠ UTF-8 (but builds on it)

Symptom. fetch("/search?q=中文") returns 404 from one backend framework and works from another.

Root cause. Percent-encoding operates on bytes, not on codepoints. is one codepoint but three UTF-8 bytes (E4 B8 AD), each separately percent-encoded as %E4%B8%AD, nine ASCII characters in the URL. A framework that decodes the URL as Latin-1 instead of UTF-8 will hand the handler the three garbled bytes interpreted as three single-byte characters.

Fix. Use encodeURIComponent("中文") on the client (browsers do UTF-8 + percent-encode in one step) and confirm the server framework decodes URLs as UTF-8 (all modern frameworks default to it). For visual confirmation, paste 中文 into the URL Decoder/Encoder and watch it become %E4%B8%AD%E6%96%87. The full chain is covered in the URL Encoding & Decoding Guide.

Trap 4: Base64 input is bytes, but you typed a string

Symptom. btoa("你好") throws InvalidCharacterError: The string contains characters outside the Latin1 range.

Root cause. btoa was designed in the ASCII / Latin-1 era. It expects each input character to fit in a single byte (codepoints 0-255). 你好 is UTF-16 in the JS engine with codepoints U+4F60 U+597D, both well above 255.

Fix. Encode to UTF-8 bytes first, then Base64-encode those bytes.

// Wrong:
btoa("你好");  // throws

// Correct:
const bytes = new TextEncoder().encode("你好");
// Uint8Array(6) [228, 189, 160, 229, 165, 189]
const b64 = btoa(String.fromCharCode(...bytes));
// "5L2g5aW9"

The longer story is in Understanding Base64 and the Base64 Complete Guide; the Base64 Decoder/Encoder does the conversion in one step and shows the intermediate byte stream.

Trap 5: String.length for validation (Twitter / SMS limits)

Symptom. A 280-character composer validates client-side, then the API returns 422. Or the reverse, a perfectly fine post is refused by the client.

Root cause. JavaScript’s .length counts UTF-16 code units; a single emoji counts as 2. Twitter counts codepoints (emoji = 1). The character count is wrong in opposite directions depending on which API you trust.

Fix. Use [...text].length for codepoint count, or Intl.Segmenter for true grapheme count (the Bluesky / iMessage approach). Platform-by-platform numbers and SMS GSM-7 versus UCS-2 boundaries are catalogued in the character & word limits guide for Twitter, SMS, and Instagram.

Trap 6: ZWJ emoji families count as N codepoints, 1 grapheme

Symptom. "👨‍👩‍👧".length === 8. Counting codepoints gives 5. To the user it is one image.

Root cause. Zero-Width Joiner (U+200D) glues multiple emoji codepoints into a single rendered cluster, three person emoji plus two ZWJs equals five codepoints, eight UTF-16 code units, one grapheme.

Fix.

const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...seg.segment("👨‍👩‍👧")].length;  // 1

Intl.Segmenter is in Node 22+ and every current browser. For older runtimes, the grapheme-splitter package implements UAX #29.

Trap 7: JSON \uXXXX escape, codepoints above U+FFFF need a surrogate pair

Symptom. A JSON payload contains "😀" and the receiving decoder either renders it correctly as 😀 or shows two box characters, depending on whether it understands surrogate pairs in JSON.

Root cause. JSON’s \uXXXX escape only accepts exactly four hex digits, i.e. one UTF-16 code unit. Encoding 😀 (U+1F600) requires the surrogate pair 😀. There is no \u{...} brace syntax in JSON.

Fix. Either accept the surrogate pair (every spec-compliant parser handles it) or write the emoji literally. JSON allows any UTF-8 character outside the escape syntax, and most modern parsers prefer that form.

Trap 8: HTTP Content-Type: charset= defaults are not what you think

Symptom. A UTF-8 HTML page renders as Mojibake in one browser and correctly in another.

Root cause. RFC 2616 originally mandated ISO-8859-1 as the default for text/* responses with no explicit charset. RFC 7231 (2014) removed that default, leaving each browser to guess. Some sniff content, some fall back to UTF-8, some default to the system locale.

Fix. Always send Content-Type: text/html; charset=utf-8 from the server and <meta charset="utf-8"> in the document head. Either alone works; both is the safe-and-sorry option for legacy proxies that strip headers.

To watch any of these traps live at the byte level, the Base64 Decoder/Encoder is the fastest microscope: paste a string, encode to Base64, and the decoded payload is the UTF-8 stream.

Choosing the right encoding, decision matrix

For the utf-8 vs utf-16 question, the answer is almost always UTF-8. The table below covers the edge cases.

Decision matrix

ScenarioPickWhy
Web pages, API JSON, source filesUTF-8 (no BOM)ASCII-compatible, no byte order, smallest for Latin text, RFC 8259 mandates UTF-8 for JSON
Heavy CJK storage (Chinese DB, Japanese game data)UTF-8 (utf8mb4)UTF-8 uses 3 bytes per CJK character vs UTF-16’s 2, but ASCII overhead from markup and JSON keys still leaves UTF-8 ahead in practice, and the surrounding ecosystem is UTF-8
Windows native API, legacy Java / C# codeUTF-16Platform default; converting at every API call invites bugs
Index-heavy in-memory text processingUTF-32Constant-time codepoint access; worth it only for parser hot paths
CSV opened in Excel on WindowsUTF-8 with BOMExcel reads BOM-less UTF-8 as ANSI and mangles CJK headers
New project, no constraintsUTF-8 (no BOM)The encoding wars settled years ago

Two rules of thumb

  1. Default to UTF-8 everywhere unless a platform forces otherwise. The W3C, IETF, and Unicode Consortium all agree.
  2. Convert at the boundary, not in the middle. Decode bytes to your language’s native string type on ingest. Operate on strings, never bytes, in business logic. Encode back to UTF-8 on output. This “UTF-8 sandwich” removes the entire class of mid-pipeline mojibake bugs.

Frequently asked questions

Is UTF-8 always backward compatible with ASCII?

Yes. Any valid ASCII file is bit-identical to its UTF-8 representation. The first 128 codepoints (U+0000U+007F) encode as a single byte with the high bit clear. Legacy ASCII-only tools (early grep, sed, classic shell pipes) process pure-ASCII UTF-8 files without modification. Trouble starts only when non-ASCII bytes (high bit set) enter the stream.

Should I use UTF-8 BOM in my files?

Default to no. HTML, JSON, JavaScript, and CSS files break or warn in some parsers when a BOM appears at the start. The standard exception is CSV intended for Excel on Windows. Without the BOM, Excel guesses ANSI and mangles Chinese, Japanese, or Korean headers. See the BOM decision matrix in section 5.

Why does "😀".length === 2 in JavaScript?

JavaScript strings are stored as UTF-16, and .length returns the number of code units, not characters. 😀 (U+1F600) lives in the supplementary plane and requires a surrogate pair, two 16-bit code units, so .length is 2. Use [..."😀"].length, Array.from("😀").length, or Intl.Segmenter for a true count.

What’s the difference between Unicode and UTF-8?

Unicode is the character table that assigns a codepoint (a number like U+1F600) to every character. UTF-8 is one of several encodings that translate those codepoints into bytes (1 to 4 bytes per codepoint). Unicode defines what a character is; UTF-8 defines how it travels through a file or network. UTF-16 and UTF-32 are alternative encodings of the same Unicode table.

Is utf8mb4 always safer than utf8 in MySQL?

Yes for new projects. MySQL’s utf8 is the misnamed 3-byte-limited variant utf8mb3, which cannot store any character above U+FFFF (every emoji, many rare CJK characters, all historical scripts). utf8mb4 is full 4-byte UTF-8. The one caveat is index length: each utf8mb4 character may take 4 bytes, so the 767-byte InnoDB legacy index limit caps unique indexes at 191 characters (resolved by innodb_large_prefix in MySQL 5.7+ and the default in 8.0).

How do I detect the encoding of an unknown file?

Use file on Unix, chardet or charset-normalizer in Python, or jschardet in Node. None are perfect; they statistically guess from byte distribution. UTF-8 detection is highly reliable thanks to the continuation-byte pattern. Windows-1252, ISO-8859-1, and other single-byte legacy encodings are nearly indistinguishable from each other, so detection often comes down to language heuristics.

Can UTF-16 represent every Unicode character?

Yes. UTF-16 covers all 1,114,112 codepoints. BMP characters (U+0000U+FFFF) use one 16-bit code unit (2 bytes), and supplementary plane characters (U+10000U+10FFFF) use surrogate pairs (4 bytes). Coverage is identical to UTF-8 and UTF-32; only the byte layout and processing semantics differ. The choice between them is about ecosystem fit, not capability.

Tags: unicode utf-8 utf-16 character-encoding surrogate-pair encoding

Related Articles

View all articles