Why use GlyphWorks?

Every operation runs in your browser — paste credentials, internal tooling strings, or any sensitive text without transmitting it anywhere.

Security inspection in one place

Most Unicode tools show you a character's name and encoding. GlyphWorks also flags the characters that matter for security work: invisible zero-width characters used to bypass filters, bidi override controls used in filename spoofing attacks, and homoglyphs that can make a Cyrillic domain name look like its ASCII equivalent. You don't need three separate tools.

Every encoding you'd ever need to look up

The output covers UTF-8 bytes, UTF-16 code units, HTML named and numeric entities in both decimal and hex, CSS escape syntax, and JavaScript string escape syntax. Whether you're debugging a character encoding issue in a web page, writing a CSS content property, or tracking down a mismatch in a JavaScript string comparison, the right representation is already there.

Handles grapheme clusters, not just code units

A single visible character — a flag emoji, a skin-toned hand gesture, a letter with combining marks — can be made up of multiple codepoints. JavaScript's length property and most naive string tools count code units and get this wrong. GlyphWorks uses the browser's Intl.Segmenter to count grapheme clusters (what a human sees as one character) and reports it separately from the codepoint count.

Normalization with one click

If your text contains characters that can be represented multiple ways, GlyphWorks shows all four Unicode normal forms side by side and lets you apply any of them directly. NFC is what most databases expect; NFD separates diacritics from base characters for linguistic processing. No separate tool or library invocation needed.

Unicode codepoints: what they are and why they matter

Every Unicode character has a codepoint — a number between U+0000 and U+10FFFF assigned by the Unicode Consortium. The Latin letter 'a' is U+0061; the euro sign is U+20AC; the grinning face emoji is U+1F600. Codepoints are the stable identity of a character: encoding schemes like UTF-8 and UTF-16 are just different ways of storing that number in bytes. GlyphWorks reports the codepoint for every character in the string, along with its official Unicode name, General Category, block, and script — the same data that defines what a character is in the Unicode Standard.

UTF-8 and UTF-16: how the same codepoint becomes different bytes

UTF-8 encodes each codepoint as 1 to 4 bytes, chosen to be compact for ASCII text: codepoints below 128 use one byte, matching ASCII exactly. Higher codepoints use continuation bytes with specific patterns. UTF-16 uses 2-byte code units: codepoints in the Basic Multilingual Plane (U+0000–U+FFFF) encode as a single unit; codepoints above U+FFFF encode as a surrogate pair — two 2-byte units in the range D800–DFFF. JavaScript strings are stored internally as UTF-16, which is why str.length can be 2 for an emoji with a single codepoint above U+FFFF. GlyphWorks shows both encodings as hex byte sequences so you can see exactly what ends up in memory or on the wire.

HTML entities, CSS escapes, and JavaScript escapes

The same character needs to be represented differently depending on context. In HTML, the less-than sign must be written as < or < to avoid being parsed as a tag. In a CSS content property, the euro sign can be written as \20AC. In a JavaScript string, the same character is \u20AC or \u{20AC} in a template literal. GlyphWorks provides all three representations for every character — named entity when one exists, decimal and hex numeric references for HTML, and the correct escape syntax for CSS and JS.

Security: invisible characters, bidi controls, and homoglyphs

Several Unicode categories are routinely exploited in attacks. Zero-width characters (U+200B, U+FEFF, U+200D, and others) are invisible but change how software processes a string — used to bypass keyword filters, insert hidden content in documents, or create file names that look identical but differ. Bidi override characters (U+202E, U+202D, and related) can reverse the visible order of text in a UI while the stored bytes read differently — used in the 'right-to-left override' filename attack where evil‮gpj.exe displays as evilexe.jpg. Homoglyphs are characters that look identical to ASCII letters but have different codepoints — the Cyrillic 'а' (U+0430) is visually indistinguishable from the Latin 'a' (U+0061) in most fonts, enabling spoofed domain names and identifiers. GlyphWorks flags all three categories with clear warning badges.

Why the same character can look identical but be completely different

Every character on your screen has a number assigned to it by the Unicode Standard — called a codepoint. The letter ‘a’ is U+0061. But the Cyrillic alphabet has a character that looks exactly like ‘a’ in almost every font — and its number is U+0430. These are homoglyphs: visually identical characters with different identities. Attackers use them to register domain names or create identifiers that look legitimate but aren’t. GlyphWorks shows you the true identity of every character in your text, so a suspicious string has nowhere to hide.

How the same text is stored differently depending on where it is

When you type a character, different systems store it in different ways. UTF-8 (used in most web content) stores common Latin characters as a single byte, and complex characters as up to four bytes. UTF-16 (used inside JavaScript) always uses at least two bytes. HTML needs characters that look like code — like < and > — to be written in a special form so they don’t confuse the browser’s parser. CSS and JavaScript each have their own syntax for including unusual characters in source code. GlyphWorks shows all of these representations at once, so you know exactly how to write any character in whatever context you’re working in.

The invisible characters hiding in your copy-pasted text

Some Unicode characters are completely invisible. Zero-width spaces, zero-width joiners, and various format characters take up no visible space on screen but exist in the string. They can sneak into text when you copy from PDFs, websites, or word processors. This causes subtle bugs: string comparisons fail because two strings that look identical actually differ, form validation rejects input that looks valid, and log searches miss entries because the search term and the stored value don’t quite match. GlyphWorks shows you every character in your text — including the ones you can’t see — and lets you clean them out with one click.

Unicode normalization: why the same word can be two different strings

A character with a diacritic — like the letter é — can be stored two ways in Unicode: as a single precomposed character (U+00E9) or as a base letter ‘e’ followed by a combining acute accent mark (U+0065 + U+0301). Both look identical, but the strings are byte-for-byte different. This causes broken string comparisons, search mismatches, and duplicate entries in databases. Unicode defines four “normal forms” that standardize which representation to use. NFC (the most common) combines characters into their precomposed form — what you want for most storage and comparison. GlyphWorks shows you how a string changes under each form, and lets you normalize it directly.

What is this character, exactly?