UTF-8 and Unicode in URL Encoding: International Characters Guide

Published 2026-02-18

The Challenge of International Characters in URLs

The original URL specification (RFC 1738, 1994) was designed for ASCII text only, completely excluding the vast majority of the world's written languages. This created a fundamental problem for internationalization: how do you create URLs for content in Japanese, Arabic, Chinese, Russian, Hindi, or any of the hundreds of other scripts used around the world? The solution evolved over time through two complementary mechanisms: percent-encoding of UTF-8 bytes for URL paths and query parameters, and Internationalized Domain Names (IDN) with Punycode encoding for domain names. Understanding both mechanisms is essential for anyone building multilingual web applications, international APIs, or global e-commerce platforms. The good news is that modern browsers and frameworks handle most of the heavy lifting automatically — but knowing the underlying mechanics helps you debug issues, choose the right tools, and design systems that work correctly across all human languages.

How UTF-8 Encoding Works in URLs

When a non-ASCII character appears in a URL path or query parameter, it must be encoded as its UTF-8 byte sequence, with each byte percent-encoded. UTF-8 is a variable-length encoding that represents Unicode characters using 1 to 4 bytes. The process works as follows: (1) take the character's Unicode code point, (2) encode it as UTF-8 bytes, (3) percent-encode each byte with a % prefix. For example, the Chinese character for "middle" (U+4E2D) encodes to the UTF-8 bytes 0xE4, 0xB8, 0xAD, yielding the URL encoding %E4%B8%AD. The Arabic letter ain (U+0639) becomes UTF-8 bytes 0xD8, 0xB9, encoding as %D8%B9. The Hindi character a (U+0905) becomes %E0%A4%85. A single party emoji (U+1F389) uses four bytes: %F0%9F%8E%89. This is why international URLs can look so much longer than their displayed form — each non-ASCII character can expand to 3–12 characters in the encoded form.

Internationalized Resource Identifiers (IRIs)

IRIs (Internationalized Resource Identifiers), defined in RFC 3987, extend the URL syntax to allow Unicode characters directly in the identifier. An IRI looks like a URL but can contain characters from any Unicode script without percent-encoding. For example, a valid IRI might be https://example.jp/path?query=value. Browsers display URLs in IRI form for readability — when you visit a Japanese or Arabic website, your browser's address bar shows the Unicode characters rather than their encoded equivalents. However, when the browser actually sends the HTTP request, it converts the IRI to a URI by encoding non-ASCII characters as UTF-8 percent-encoded sequences. This conversion is transparent to users but important for developers to understand: the string in the address bar and the string in the HTTP request header may be different representations of the same resource. When building web crawlers, link parsers, or URL validators, always handle both forms.

Internationalized Domain Names (IDN) and Punycode

Domain names have their own internationalization mechanism separate from percent-encoding. The DNS system, which resolves domain names to IP addresses, was designed for ASCII characters only and uses a different encoding called Punycode (defined in RFC 3492). Punycode converts Unicode domain labels to an ASCII-compatible encoding (ACE) form that starts with the prefix xn--. For example, a Chinese domain becomes xn--fiq228c.com in Punycode. A Japanese domain becomes xn--wgv71a309e.jp. Modern browsers handle this translation automatically — users see the Unicode domain name, but DNS queries use the Punycode form. When building applications that need to handle domain names, use an IDN library rather than implementing Punycode yourself. In Node.js, the built-in url module handles IDN automatically; in PHP, idn_to_ascii() and idn_to_utf8() convert between forms.

Handling International Text in Search Queries

One of the most common use cases for international URL encoding is search functionality. When users search for text in their native language, the query must be properly encoded for the URL. Modern browsers handle this automatically for HTML forms — a Japanese user typing a query into a search box will see the URL become encoded with percent-encoded UTF-8 sequences (or the browser may display the Unicode characters in the address bar). Server-side, the query string is automatically decoded by your web framework, so you receive the original Unicode string. For server-side URL construction, always use your language's encoding library: In Python, urllib.parse.urlencode({'q': 'query'}) correctly produces the encoded form. In Java, URLEncoder.encode("query", StandardCharsets.UTF_8) handles the UTF-8 encoding. In PHP, urlencode('query') or http_build_query(['q' => 'query']). The UTF-8 encoding must be specified explicitly in older APIs that default to platform encoding — always specify UTF-8 to ensure consistent behavior across different server environments.

Unicode Normalization and URL Canonicalization

A subtle but important issue in international URL encoding is Unicode normalization. Some characters can be represented in multiple equivalent ways in Unicode. The character e with acute can be a single precomposed character (U+00E9, "e with acute") or a combining sequence of two code points: the letter e (U+0065) followed by a combining acute accent (U+0301). Both look identical when rendered, but they produce different byte sequences — and therefore different percent-encoded URLs: %C3%A9 for the precomposed form vs. e%CC%81 for the decomposed form. This means two URLs that appear identical to users might actually be different strings, breaking cache lookups, canonical URL detection, and equality comparisons. The solution is to apply Unicode normalization (specifically NFC form) before encoding URLs. In JavaScript: str.normalize('NFC'). In Python: unicodedata.normalize('NFC', text). Web servers and browsers generally use NFC, making it the safe standard choice for URL construction.

Right-to-Left Languages and Bidirectional URLs

URLs containing text in right-to-left (RTL) scripts like Arabic, Hebrew, Persian, or Urdu present additional complexity. While the URL structure itself always flows left-to-right (scheme, then host, then path, then query), the decoded text within those components may display right-to-left. Browsers generally handle the visual rendering correctly, but some development tools and log viewers may display bidirectional URLs confusingly. A more serious issue is bidirectional text spoofing: URLs that exploit Unicode bidirectional control characters to make a URL appear different from its actual destination. For example, a malicious URL might use RTL override characters to display a trusted domain name in the browser's status bar while actually linking to a different domain. Modern browsers have added safeguards against these attacks (Internationalized Domain Name homograph attacks), but developers should still be aware of the issue when displaying URLs to users or accepting URL input. Always sanitize and validate URLs from user input, and consider using a URL parser to extract and verify the actual hostname before trusting a displayed URL.

Practical Guide: Building Multilingual URLs

Here are the key principles for building URLs that work correctly with international content. Always use UTF-8 as your encoding throughout your stack — database, application server, HTTP headers, and HTML meta charset. Apply Unicode normalization (NFC) before encoding URLs with international text. Use library functions for encoding rather than manual implementation. Test with actual international content from each language your application supports — don't assume that because it works for English it works for Japanese or Arabic. Store and compare URLs in their normalized, encoded form for consistency. Set the lang attribute on your HTML pages to help browsers apply correct text direction and rendering. Use the hreflang attribute in your HTML to help search engines understand language and region targeting. Test with internationalized domain names if your platform supports them — verify that your URL validation logic accepts IDN hostnames. Our free URL encoder/decoder tool fully supports UTF-8 encoding and decoding of any Unicode character, including CJK characters, emoji, and text from all world scripts — try it to verify how your international strings will appear in URLs.

Encode URLs Instantly

Encode and decode URLs with full Unicode support, multiple encoding modes, and batch processing.

Open URL Encoder