UTF-8 vs UTF-16 vs Unicode Encoding Guide for Developers

Short answer to what most searches for utf-8 unicode encoding are actually asking: Unicode and UTF-8 are not the same thing. Unicode is a giant numbered table that assigns a codepoint (a number like U+1F600) to every character. UTF-8, UTF-16, and UTF-32 are byte representations, three different ways of turning those codepoints into bytes.

UTF-8 is the one you almost always want. It is byte-identical to ASCII for English text, scales to four bytes for every emoji, and is mandated by JSON, HTML5, and most modern protocols.

This guide is for the developer who has been bitten: the MySQL Incorrect string value error on a 😀, the JavaScript surprise of "😀".length === 2, the CSV that opens fine in cat but garbled in Excel. We will walk from codepoints up through UTF-8 byte mechanics, surrogate pairs, BOMs, nine languages’ default behavior, and eight production pitfalls, then close with a decision matrix and FAQ.

Want to verify a byte sequence as you read? Paste any string into the Base64 Decoder/Encoder. The decoded payload is exactly the UTF-8 byte stream that this article explains.

Why encoding still bites you in 2026

Three scenarios, all from real bug trackers in the last twelve months:

MySQL rejects an emoji. A user submits Hello 😀 and the server returns Incorrect string value: '\xF0\x9F\x98\x80'. The table is utf8, the developer thinks “that’s UTF-8, what’s wrong?”, and the answer is buried in MySQL history (covered in section 7).
A character counter ships broken. A 280-character tweet validator uses text.length, accepts a message full of emoji, and the API rejects it. The reverse also happens: a valid post is refused by the front end. Symptom diagnosed in section 4.
Local HTML turns into “ä¸æ–‡”. A developer saves a file in Windows-1252, opens it in a browser that guesses UTF-8, and watches Mojibake bloom. This is the BOM / charset declaration story in section 5, with parallels to the URL Encoding & Decoding Guide where the same byte-vs-character mismatch wrecks query strings.

What you get out of this guide: by the last page you will (a) distinguish Unicode from UTF-8 in one sentence, (b) pick between UTF-8, UTF-16, and UTF-32 for any new project, (c) write code that correctly counts emoji in every major language, and (d) debug any charset bug from byte stream alone. The character encoding rabbit hole is deep, but the working surface area you need day to day is small.

What is Unicode? Codepoints vs characters vs glyphs

Unicode is a character table that assigns a unique number, a codepoint such as U+1F600, to every character. UTF-8, UTF-16, and UTF-32 are encodings that translate those codepoints into bytes. Unicode itself stores no bytes; it only defines the mapping from abstract character to integer.

Three more terms get tangled because they often refer to the same visible mark:

Three layers you must separate

Codepoint (U+0041, U+1F600): the integer Unicode assigns. The space runs from U+0000 to U+10FFFF, roughly 1.1 million slots, of which about 150,000 are currently assigned.
Character (or abstract character): the semantic identity, Latin capital A, grinning face emoji.
Glyph: the visual shape a font renders. One character has many glyphs: a serif A, an italic A, a hand-drawn A. Unicode does not care about glyphs.
Grapheme cluster: what a user perceives as a single “character.” Often one codepoint, sometimes several. The letter á can be one codepoint U+00E1 or two codepoints a + U+0301 (combining acute accent). The character limits across Twitter, SMS, and SEO covers how each platform draws this line differently.

If you remember nothing else, remember: codepoint, encoding, bytes, rendering. Each arrow can break independently.

Codepoint notation, `U+XXXX` and `\uXXXX`

You will see codepoints written in several flavors. U+0041 is the canonical Unicode notation: four to six hex digits, prefixed U+. In source code:

JavaScript / JSON: "A" (four hex digits, BMP only) and "\u{1F600}" (ES6 braces, any codepoint).
Python: "A" (4 digits), "\U00000041" (8 digits, capital U), "\N{LATIN CAPITAL LETTER A}" (by name).
Shell / git log / sed output: you often see raw UTF-8 bytes such as \xc3\xa9 for é. That is not a codepoint, that is the encoded form, which leads us to section 3.

The 17 planes, BMP and beyond

Unicode partitions its codepoint space into 17 planes of 65,536 codepoints each, 17 × 2^16 = 1,114,112.

Plane 0, the Basic Multilingual Plane (BMP): U+0000 to U+FFFF. Latin, CJK ideographs, Cyrillic, Arabic, Greek, almost every script you encounter in legacy text lives here.
Planes 1-16, the supplementary planes: U+10000 to U+10FFFF. Most emoji (U+1F600 and friends), rare CJK characters, historical scripts (Egyptian hieroglyphs, cuneiform), musical notation.

The BMP / supplementary boundary at U+FFFF is the single most important number in this article. It is where UTF-16 stops being one code unit per character, where UTF-8 jumps from three bytes to four, and where MySQL’s misnamed utf8 collation gives up.

Quick sanity check with emoji

"a"        → 1 codepoint  U+0061             → 1 grapheme
"é" (NFC)  → 1 codepoint  U+00E9             → 1 grapheme
"é" (NFD)  → 2 codepoints U+0065 U+0301      → 1 grapheme
"😀"        → 1 codepoint  U+1F600 (Plane 1)  → 1 grapheme
"👨‍👩‍👧"      → 5 codepoints (3 people + 2 ZWJ U+200D) → 1 grapheme

The last row is where things get awkward. The family emoji is one user-perceived character, but five codepoints joined by Zero-Width Joiners. Every layer of the stack can count it differently, and section 7 trap 6 is the bug report this disagreement files.

UTF-8 encoding mechanics, how 1-4 bytes work

UTF-8 encodes Unicode codepoints in 1 to 4 bytes. ASCII (U+0000–U+007F) uses 1 byte and is byte-identical to ASCII. Higher codepoints use multi-byte sequences where the first byte signals total length and every continuation byte starts with the bit pattern 10xxxxxx. This self-describing layout is the reason UTF-8 came out on top.

The byte-pattern table, UTF-8 in one diagram

Codepoint range	UTF-8 bytes	Byte pattern
`U+0000` – `U+007F`	1 byte	`0xxxxxxx`
`U+0080` – `U+07FF`	2 bytes	`110xxxxx 10xxxxxx`
`U+0800` – `U+FFFF`	3 bytes	`1110xxxx 10xxxxxx 10xxxxxx`
`U+10000` – `U+10FFFF`	4 bytes	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

Each x is a data bit drawn from the codepoint’s binary representation. The leading 0 / 110 / 1110 / 11110 tells the decoder how many bytes total; the leading 10 marks every continuation byte. That redundancy is what makes UTF-8 self-synchronizing: drop a byte and you can resume at the next start byte instead of corrupting everything downstream.

Worked example, encoding `中` (U+4E2D)

Codepoint 0x4E2D falls in U+0800–U+FFFF, so we use the 3-byte template.

Binary: 0x4E2D = 0100 1110 0010 1101 (16 bits).
Split 4-6-6 to fit the x slots: 0100 / 111000 / 101101.
Substitute into 1110xxxx 10xxxxxx 10xxxxxx: 11100100 10111000 10101101.
Hex: 0xE4 0xB8 0xAD.

That is exactly why 中 becomes %E4%B8%AD after URL-encoding: percent-encoding wraps each UTF-8 byte in %XX, it does not encode the codepoint directly. Section 7 trap 3 details the chain.

Worked example, encoding `😀` (U+1F600)

Codepoint 0x1F600 exceeds the BMP, so we use the 4-byte template.

Binary: 0x1F600 = 0 0001 1111 0110 0000 0000 (21 bits, padded).
Split 3-6-6-6: 000 / 011111 / 011000 / 000000.
Substitute into 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 11110000 10011111 10011000 10000000.
Hex: 0xF0 0x9F 0x98 0x80.

Those four bytes are what MySQL’s utf8 collation chokes on. It allocates three bytes per character maximum. Section 7 trap 1 has the fix.

Why UTF-8 came out on top

ASCII compatibility. A file of pure ASCII text is identical at the byte level to its UTF-8 encoding. Decades of tools that predate Unicode (grep, awk, classic shell pipes) continue to work for that subset.

Self-synchronization. Continuation bytes always start with 10, which never collides with any start byte. Lose one byte in a network transfer and you resync at the next character boundary instead of cascading garbage down the pipe.

No byte order. UTF-8 is a stream of bytes, not 16-bit or 32-bit units, so endianness is irrelevant. UTF-16 and UTF-32 need a Byte Order Mark to declare which end goes first; UTF-8 does not, and usually should not (see section 5).

Invalid UTF-8, what the spec forbids

A strict decoder will reject these byte sequences:

5- or 6-byte sequences. Early RFCs allowed them; RFC 3629 (2003) capped UTF-8 at 4 bytes to match the 21-bit Unicode space.
Overlong encodings. Encoding / as three bytes 0xE0 0x80 0xAF instead of one byte 0x2F. Once a fertile source of directory-traversal exploits in path validators that decoded after sanitizing.
Lone surrogate codepoints (U+D800–U+DFFF). These are reserved for UTF-16 and should never appear in UTF-8.
Truncated sequences. A 3-byte start byte followed by only one continuation byte, common when user input is cut at a byte boundary in the middle of a multi-byte character.

To see any of this concretely, drop a string into the Base64 Decoder/Encoder, encode it, then decode it back as bytes. The byte array between encoder and decoder is the UTF-8 stream this section describes.

UTF-16 and surrogate pairs, why JavaScript `length` lies

The most common search around utf-8 vs utf-16 is really “why does "😀".length equal 2 in my code?” The answer is surrogate pairs, and it is a 1990s decision that JavaScript, Java, C#, and Windows all inherited.

UTF-16 in one paragraph

UTF-16 represents Unicode using 16-bit code units. Characters in the BMP (U+0000–U+FFFF) take exactly one code unit. Characters in the supplementary planes (U+10000–U+10FFFF) take two code units, called a surrogate pair: a high surrogate in U+D800–U+DBFF followed by a low surrogate in U+DC00–U+DFFF. That U+D800–U+DFFF block is permanently reserved in Unicode so no real character lives there. UTF-16 is the internal string format for JavaScript, Java, C# (.NET), Windows kernel APIs, Objective-C NSString, and Qt, all designed when 65,536 characters looked like plenty.

The `String.length` trap

"a".length          // 1   — BMP, single code unit
"é".length          // 1   — BMP (U+00E9), single code unit
"中".length         // 1   — BMP (U+4E2D), single code unit
"😀".length         // 2   — supplementary plane (U+1F600), surrogate pair!
"a😀".length        // 3   — one BMP + two surrogate units

String.prototype.length reports the number of UTF-16 code units, not the number of characters. Anything from the supplementary plane reads as 2. The same trap exists in Java’s String.length() and C#‘s string.Length.

Counting codepoints correctly in JS

[..."😀"].length              // 1 — spread iterator walks codepoints
Array.from("😀").length       // 1 — Array.from also walks codepoints
"😀".match(/./gu).length      // 1 — /u flag = unicode-aware regex

// "😀".charAt(0) returns the lone high surrogate (visually broken)
"😀".codePointAt(0)           // 128512 — the full codepoint U+1F600

The spread operator and Array.from use the iterator protocol, which the language spec defines as walking codepoints. Plain index access (str[0], charAt) still returns code units and will hand you half a surrogate pair on emoji.

Python, `len()` already does the right thing (almost)

len("😀")           # 1   — Python 3 strings are codepoint-indexed
len("👨‍👩‍👧")        # 5   — codepoints (3 humans + 2 ZWJ), not graphemes
# Python 2 was byte-indexed by default — len("😀") returned 4

Python 3 stores strings in a flexible 1-, 2-, or 4-byte representation (PEP 393) and indexes by codepoint. len("😀") is 1, but it is still not the grapheme count. The family emoji still reads as 5. To count user-perceived characters you need a grapheme library: Intl.Segmenter in JavaScript (Node 22+, all current browsers), grapheme or regex in Python, or simply Swift, whose String.count is the only mainstream language that defaults to grapheme counting.

UTF-16 vs UCS-2, the silent migration

Before 1996, Unicode promised to fit in 16 bits and the corresponding encoding was UCS-2, a fixed 2-byte mapping. Unicode 2.0 broke that promise by adding the supplementary planes. UTF-16 is the patched version using surrogate pairs. The JavaScript spec still cites the old UCS-2 vocabulary in places, which is why the language tolerates lone surrogates that should be illegal. The “WTF-16” jokes are real. Web platform APIs (DOM, fetch, TextEncoder) reject lone surrogates because they cannot be encoded to valid UTF-8.

UTF-32, BOM, and the byte order question

UTF-32, the simple, wasteful one

UTF-32 uses a fixed 4 bytes per codepoint. U+0041 is stored as 0x00000041, U+1F600 as 0x0001F600. The advantage is constant-time random access: the n-th codepoint sits at byte offset 4n. The disadvantage is size. Pure ASCII text balloons to four times its UTF-8 footprint, and even CJK text doubles. Almost no system stores UTF-32 on disk. Internally, Python 3 chooses 1, 2, or 4 bytes per string based on the highest codepoint; the Linux fontconfig stack uses UTF-32 for its in-memory glyph tables.

Byte order, why endianness matters for UTF-16 / UTF-32

UTF-8 is a stream of single bytes, so endianness does not apply. UTF-16 and UTF-32 operate on multi-byte units, and different CPUs disagree about which end of a number comes first.

U+0041 ('A') in UTF-16 BE → 00 41
U+0041 ('A') in UTF-16 LE → 41 00

x86 and ARM CPUs are little-endian; older PowerPC and “network byte order” are big-endian. When you write a UTF-16 file you must commit to one and tell the reader which, which is what the BOM is for.

The BOM, what it is, when to use

A Byte Order Mark is U+FEFF placed at the start of a file. Encoded, it announces both the encoding and (for UTF-16 / UTF-32) the byte order.

Encoding	BOM bytes
UTF-8	`EF BB BF`
UTF-16 BE	`FE FF`
UTF-16 LE	`FF FE`
UTF-32 BE	`00 00 FE FF`
UTF-32 LE	`FF FE 00 00`

The utf-8 BOM exists, but it carries no byte-order information because UTF-8 has no byte order. Its only job is to declare “this file is UTF-8”, useful for tools that have no other signal, harmful for tools that expect the file to begin with a magic number or directive.

BOM decision matrix, should I add it?

Format	UTF-8 BOM	UTF-16 BOM	UTF-32 BOM
HTML	No (breaks `<!doctype>` detection in old parsers)	—	—
JSON	No (RFC 8259 forbids it)	—	—
JavaScript / CSS source	Avoid (older Node and IE choke)	—	—
CSV opened in Excel	Yes (Excel reads non-BOM UTF-8 as ANSI and mangles CJK)	—	—
XML	Optional (XML declaration already states encoding)	Required	Required
Plain text `.txt`	Optional (Windows Notepad adds one by default)	Required	Required

Short rule: drop the UTF-8 BOM from anything served on the web; add it to CSVs you want Excel to open; let the reader decide for everything else.

9 languages side-by-side, default encoding behavior

Cross-language work is where this knowledge pays off. The same string "a😀é" produces a different length in every runtime you call from your Bash script.

The cross-language behavior table

Language	Source file encoding	String storage	`length` / `len` counts	Default I/O encoding	4-byte emoji safe?
JavaScript (V8 / SpiderMonkey)	UTF-8	UTF-16	UTF-16 code units	UTF-8 (Node, Web)	Yes, but `.length === 2`
Python 3	UTF-8 (PEP 3120)	dynamic 1 / 2 / 4 byte (PEP 393)	codepoints	UTF-8 (PEP 540 since 3.7)	Yes, `len === 1`
Java	UTF-8 (javac default)	UTF-16	UTF-16 code units	platform charset → UTF-8 (JEP 400, JDK 18+)	Yes, but `.length() === 2`
Go	UTF-8	UTF-8 bytes	bytes (`utf8.RuneCountInString` for codepoints)	UTF-8	Yes, `len(s)` returns bytes
Rust	UTF-8	UTF-8 bytes (`String` invariant)	`.len()` bytes, `.chars().count()` codepoints	UTF-8	Yes, explicit
C# (.NET)	UTF-8 (default since .NET Core 3.0)	UTF-16	UTF-16 code units	UTF-8 (`Encoding.Default` since .NET 5)	Yes, but `.Length === 2`
Ruby	UTF-8 (since 2.0)	per-string encoding tag	codepoints (`.length`)	UTF-8	Yes, `length === 1`
PHP	(no source encoding)	byte string	bytes (`strlen`); `mb_strlen` for codepoints	depends on `default_charset`	Yes, with `mb_*` family
MySQL	—	column charset	bytes (`LENGTH`), characters (`CHAR_LENGTH`)	`character_set_*` system vars	Only with `utf8mb4`

How to read the table

There are three philosophies in play, each with its own failure mode:

UTF-8 internal (Go, Rust, Ruby). The native string is bytes; length is well-defined but counts what it counts. Convert to codepoints or graphemes only when you cross a UI or validation boundary.
UTF-16 internal (JavaScript, Java, C#). Inherited from 1990s assumptions; length is code units, surrogate pair counts as 2. Use codepoint-aware iteration for any user-facing count.
Codepoint-indexed (Python 3). len gives codepoints, which feels right until you meet ZWJ emoji, at which point you still need a grapheme library.

PHP is the special case. Its built-in str* functions all operate on bytes, treating UTF-8 sequences as opaque blobs. Every non-ASCII project must use the mb_* (multibyte) family, and the recurring bug reports show how often that gets missed.

Working rule: keep UTF-8 as the wire format everywhere (files, HTTP bodies, database columns) and convert to your runtime’s native string type at the boundary. This is the “UTF-8 sandwich” we return to in section 8.

8 real-world encoding pitfalls: Mojibake, utf8mb4, and charset detection

The patterns below come up in every code review on a globalized codebase.

Trap 1: MySQL `utf8` is a 3-byte lie, switch to `utf8mb4`

Symptom. INSERT INTO users (bio) VALUES ('Hello 😀'); returns Incorrect string value: '\xF0\x9F\x98\x80' for column 'bio'.

Root cause. MySQL’s historical utf8 is an alias for utf8mb3: a UTF-8 variant capped at three bytes per character. Any codepoint above U+FFFF (every emoji, several thousand rare CJK characters, all historical scripts) requires four UTF-8 bytes and is rejected.

Fix.

ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
SET NAMES utf8mb4;  -- client connection

# my.cnf
[mysqld]
character-set-server = utf8mb4
collation-server     = utf8mb4_unicode_ci

MySQL 8.0 still ships utf8 as the utf8mb3 alias. utf8mb3 is deprecated but not yet removed. Use utf8mb4 for every new column, every new database, every new connection. There is no upside to the legacy variant.

Trap 2: Windows-1252 fallback, the question mark mystery

Symptom. A .txt exported from a Windows colleague’s Notepad reads "smart quotes" and an em dash on their machine. On your server it becomes ? or U+FFFD (replacement character).

Root cause. Older Notepad defaults to Windows-1252 (CP-1252), which encodes the curly quote " as 0x93. A UTF-8 decoder sees 0x93 as a stray continuation byte (high bit 10) without a preceding start byte and substitutes the replacement character.

Fix. Detect the source encoding (file on Unix, chardet / charset-normalizer in Python, jschardet in Node), decode with the correct codec, then re-encode as UTF-8 before saving. Standardizing on UTF-8 at ingestion stops the recurrence.

Trap 3: URL percent-encoding ≠ UTF-8 (but builds on it)

Symptom. fetch("/search?q=中文") returns 404 from one backend framework and works from another.

Root cause. Percent-encoding operates on bytes, not on codepoints. 中 is one codepoint but three UTF-8 bytes (E4 B8 AD), each separately percent-encoded as %E4%B8%AD, nine ASCII characters in the URL. A framework that decodes the URL as Latin-1 instead of UTF-8 will hand the handler the three garbled bytes interpreted as three single-byte characters.

Fix. Use encodeURIComponent("中文") on the client (browsers do UTF-8 + percent-encode in one step) and confirm the server framework decodes URLs as UTF-8 (all modern frameworks default to it). For visual confirmation, paste 中文 into the URL Decoder/Encoder and watch it become %E4%B8%AD%E6%96%87. The full chain is covered in the URL Encoding & Decoding Guide.

Trap 4: Base64 input is bytes, but you typed a string

Symptom. btoa("你好") throws InvalidCharacterError: The string contains characters outside the Latin1 range.

Root cause. btoa was designed in the ASCII / Latin-1 era. It expects each input character to fit in a single byte (codepoints 0-255). 你好 is UTF-16 in the JS engine with codepoints U+4F60 U+597D, both well above 255.

Fix. Encode to UTF-8 bytes first, then Base64-encode those bytes.

// Wrong:
btoa("你好");  // throws

// Correct:
const bytes = new TextEncoder().encode("你好");
// Uint8Array(6) [228, 189, 160, 229, 165, 189]
const b64 = btoa(String.fromCharCode(...bytes));
// "5L2g5aW9"

The longer story is in Understanding Base64 and the Base64 Complete Guide; the Base64 Decoder/Encoder does the conversion in one step and shows the intermediate byte stream.

Trap 5: `String.length` for validation (Twitter / SMS limits)

Symptom. A 280-character composer validates client-side, then the API returns 422. Or the reverse, a perfectly fine post is refused by the client.

Root cause. JavaScript’s .length counts UTF-16 code units; a single emoji counts as 2. Twitter counts codepoints (emoji = 1). The character count is wrong in opposite directions depending on which API you trust.

Fix. Use [...text].length for codepoint count, or Intl.Segmenter for true grapheme count (the Bluesky / iMessage approach). Platform-by-platform numbers and SMS GSM-7 versus UCS-2 boundaries are catalogued in the character & word limits guide for Twitter, SMS, and Instagram.

Trap 6: ZWJ emoji families count as N codepoints, 1 grapheme

Symptom. "👨‍👩‍👧".length === 8. Counting codepoints gives 5. To the user it is one image.

Root cause. Zero-Width Joiner (U+200D) glues multiple emoji codepoints into a single rendered cluster, three person emoji plus two ZWJs equals five codepoints, eight UTF-16 code units, one grapheme.

Fix.

const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...seg.segment("👨‍👩‍👧")].length;  // 1

Intl.Segmenter is in Node 22+ and every current browser. For older runtimes, the grapheme-splitter package implements UAX #29.

Trap 7: JSON `\uXXXX` escape, codepoints above U+FFFF need a surrogate pair

Symptom. A JSON payload contains "😀" and the receiving decoder either renders it correctly as 😀 or shows two box characters, depending on whether it understands surrogate pairs in JSON.

Root cause. JSON’s \uXXXX escape only accepts exactly four hex digits, i.e. one UTF-16 code unit. Encoding 😀 (U+1F600) requires the surrogate pair 😀. There is no \u{...} brace syntax in JSON.

Fix. Either accept the surrogate pair (every spec-compliant parser handles it) or write the emoji literally. JSON allows any UTF-8 character outside the escape syntax, and most modern parsers prefer that form.

Trap 8: HTTP `Content-Type: charset=` defaults are not what you think

Symptom. A UTF-8 HTML page renders as Mojibake in one browser and correctly in another.

Root cause. RFC 2616 originally mandated ISO-8859-1 as the default for text/* responses with no explicit charset. RFC 7231 (2014) removed that default, leaving each browser to guess. Some sniff content, some fall back to UTF-8, some default to the system locale.

Fix. Always send Content-Type: text/html; charset=utf-8 from the server and <meta charset="utf-8"> in the document head. Either alone works; both is the safe-and-sorry option for legacy proxies that strip headers.

To watch any of these traps live at the byte level, the Base64 Decoder/Encoder is the fastest microscope: paste a string, encode to Base64, and the decoded payload is the UTF-8 stream.

Choosing the right encoding, decision matrix

For the utf-8 vs utf-16 question, the answer is almost always UTF-8. The table below covers the edge cases.

Decision matrix

Scenario	Pick	Why
Web pages, API JSON, source files	UTF-8 (no BOM)	ASCII-compatible, no byte order, smallest for Latin text, RFC 8259 mandates UTF-8 for JSON
Heavy CJK storage (Chinese DB, Japanese game data)	UTF-8 (`utf8mb4`)	UTF-8 uses 3 bytes per CJK character vs UTF-16’s 2, but ASCII overhead from markup and JSON keys still leaves UTF-8 ahead in practice, and the surrounding ecosystem is UTF-8
Windows native API, legacy Java / C# code	UTF-16	Platform default; converting at every API call invites bugs
Index-heavy in-memory text processing	UTF-32	Constant-time codepoint access; worth it only for parser hot paths
CSV opened in Excel on Windows	UTF-8 with BOM	Excel reads BOM-less UTF-8 as ANSI and mangles CJK headers
New project, no constraints	UTF-8 (no BOM)	The encoding wars settled years ago

Two rules of thumb

Default to UTF-8 everywhere unless a platform forces otherwise. The W3C, IETF, and Unicode Consortium all agree.
Convert at the boundary, not in the middle. Decode bytes to your language’s native string type on ingest. Operate on strings, never bytes, in business logic. Encode back to UTF-8 on output. This “UTF-8 sandwich” removes the entire class of mid-pipeline mojibake bugs.

Frequently asked questions

Is UTF-8 always backward compatible with ASCII?

Yes. Any valid ASCII file is bit-identical to its UTF-8 representation. The first 128 codepoints (U+0000–U+007F) encode as a single byte with the high bit clear. Legacy ASCII-only tools (early grep, sed, classic shell pipes) process pure-ASCII UTF-8 files without modification. Trouble starts only when non-ASCII bytes (high bit set) enter the stream.

Should I use UTF-8 BOM in my files?

Default to no. HTML, JSON, JavaScript, and CSS files break or warn in some parsers when a BOM appears at the start. The standard exception is CSV intended for Excel on Windows. Without the BOM, Excel guesses ANSI and mangles Chinese, Japanese, or Korean headers. See the BOM decision matrix in section 5.

Why does `"😀".length === 2` in JavaScript?

JavaScript strings are stored as UTF-16, and .length returns the number of code units, not characters. 😀 (U+1F600) lives in the supplementary plane and requires a surrogate pair, two 16-bit code units, so .length is 2. Use [..."😀"].length, Array.from("😀").length, or Intl.Segmenter for a true count.

What’s the difference between Unicode and UTF-8?

Unicode is the character table that assigns a codepoint (a number like U+1F600) to every character. UTF-8 is one of several encodings that translate those codepoints into bytes (1 to 4 bytes per codepoint). Unicode defines what a character is; UTF-8 defines how it travels through a file or network. UTF-16 and UTF-32 are alternative encodings of the same Unicode table.

Is `utf8mb4` always safer than `utf8` in MySQL?

Yes for new projects. MySQL’s utf8 is the misnamed 3-byte-limited variant utf8mb3, which cannot store any character above U+FFFF (every emoji, many rare CJK characters, all historical scripts). utf8mb4 is full 4-byte UTF-8. The one caveat is index length: each utf8mb4 character may take 4 bytes, so the 767-byte InnoDB legacy index limit caps unique indexes at 191 characters (resolved by innodb_large_prefix in MySQL 5.7+ and the default in 8.0).

How do I detect the encoding of an unknown file?

Use file on Unix, chardet or charset-normalizer in Python, or jschardet in Node. None are perfect; they statistically guess from byte distribution. UTF-8 detection is highly reliable thanks to the continuation-byte pattern. Windows-1252, ISO-8859-1, and other single-byte legacy encodings are nearly indistinguishable from each other, so detection often comes down to language heuristics.

Can UTF-16 represent every Unicode character?

Yes. UTF-16 covers all 1,114,112 codepoints. BMP characters (U+0000–U+FFFF) use one 16-bit code unit (2 bytes), and supplementary plane characters (U+10000–U+10FFFF) use surrogate pairs (4 bytes). Coverage is identical to UTF-8 and UTF-32; only the byte layout and processing semantics differ. The choice between them is about ecosystem fit, not capability.