Gukhanmun design docs
Gukhanmun is a library for converting Korean text written in mixed script (國漢文混用體) into hangul-only text. It is the successor to Seonbi, narrowed in scope to the hanja conversion pipeline and broadened along several axes: streaming I/O, pluggable dictionaries, lattice-based segmentation, and a wider range of output formats. The project is implemented in Rust and exposed as a Rust library, a command-line tool, WebAssembly bindings, and Node-API bindings.
Goals and non-goals
The library converts hanja words in Korean text into their hangul readings, optionally annotating them with the source hanja for disambiguation, ruby markup, or stylistic reasons. It does so without disturbing structure or content outside the text it is asked to transform; regions marked as non-Korean or as preserved (code blocks and the like) pass through untouched. It streams input and output where possible, with buffering bounded by the size of contiguous conversion spans that contain hanja and by opt-in cross-context disambiguation. It accepts pluggable dictionaries that may be in-memory, mmap-backed, or otherwise opaque to the engine. It ships a default dictionary derived from the South Korean Standard Korean Language Dictionary (標準國語大辭典). It is usable from Rust, from Node.js via Node-API, from browsers and Deno via WebAssembly, and from a command line.
The library deliberately does not provide broader Korean typographic adjustments such as smart quotes, dashes, ellipses, or citation marks; those remain Seonbi's job. It does not provide a fully HTML5-conformant parser; the HTML scanner is fragment-oriented and recovers from minor malformations but is not a substitute for html5ever. It does not roundtrip Markdown byte-for-byte; semantic preservation is the contract, and original-form preservation is best effort. It does not translate between languages or transliterate beyond the hanja-to-hangul mapping.
Design principles
Three principles run through the design.
The engine is format-neutral. The same engine processes HTML, Markdown, and plain text because each format is read into a single intermediate representation and written back from it. Adapters at the boundary handle format specifics: scanning, serialization, and the format-specific notion of what counts as a preserved region or a block boundary. The engine never inspects HTML tag names, Markdown event variants, or anything else that would couple it to one format. This is what makes a single test fixture meaningful across all three formats, and what makes adding a new format a contained piece of work.
Responsibilities are split into independently selectable pipeline stages. The Reader parses an input format into the IR. The Engine finds hanja-containing lexical spans, segments them, and emits annotations. Middlewares transform the IR stream by adjusting annotation flags. The Renderer turns annotations into concrete text or markup. The Writer serializes the IR back to the target format. A user replacing one stage does not perturb the others; the boundaries are explicit values, not method calls.
Streaming is the default where correctness allows it. Operations that fundamentally require lookahead buffer until the relevant context boundary. In HTML and Markdown, the default per-block homophone window usually reaches that boundary at paragraphs, list items, headings, and similar scopes. Plain text has no block scopes, so the same default window is document-wide: an annotation on a later line can force disambiguating hanja on an earlier line, and a byte stream cannot revise text already written to stdout. Callers that require immediate plain-text output can disable homophone marking, accepting the corresponding loss of disambiguation.
Streaming conversion also preserves fallback annotation spans exactly. A
trailing run of fallback hanja is held until a following non-convertible
boundary or EOF, even when the dictionary's maximum word length is shorter.
This can make fallback-only hanja input less eager than dictionary lookahead
alone would require, but it keeps render modes such as hangul-hanja-parens
equivalent to one-shot conversion.
Architecture
The pipeline has five stages.
A reader parses bytes into a stream of input tokens. The engine reads input tokens and emits output tokens, with the difference that output tokens may include annotation tokens carrying both an original hanja form and its hangul reading. Middlewares walk the output stream and adjust the flags on those annotations. The renderer expands each annotation into concrete text or markup according to the flags and the chosen rendering mode. The writer serializes the final stream into the output format.
Intermediate representation
The intermediate representation is a flat stream of tokens parameterized by a scope data type that belongs to the adapter rather than the engine. The engine knows the shape of a token, but the contents of the scope payload (raw HTML attributes, the variant of a pulldown-cmark event, and so on) are opaque to it.
The token sent into the engine is one of:
Open(Scope<S>): enter a structural scope, such as an HTML element or a Markdown block. TheScope<S>carries the adapter's opaque data plus three pre-computed flags described below.Close: leave the most recent scope. The engine maintains the stack itself; the adapter does not have to repeat which scope is closing.Text(Cow<str>): a chunk of text that the engine may transform.Verbatim(Cow<str>): a chunk of text that must pass through untouched, such as the contents of an HTML<code>element or a Markdown code span. The adapter, not the engine, decides what is verbatim.
The token coming out of the engine is one of:
Open(Scope<S>),Close,Text(Cow<str>),Verbatim(Cow<str>): the same forms, passed through.Annotated(Annotation): a position in the stream where the engine converted a hanja word. The annotation carries the original hanja, the hangul reading, and flags describing why the conversion happened and what subsequent stages might want to do with it.
The Annotation carries policy flags. homophone is set when the effective
dictionary entry set or the active context indicates another hanja word shares
this hangul reading, so a reader of the rendered output cannot recover the word
from hangul alone. require_hanja is set when the source dictionary or a user
directive demands the original hanja be shown alongside the hangul.
require_hangul is set in the opposite direction: the source preserves the
hanja in the output, but a hangul gloss is required (used for the Original
rendering mode that keeps mixed script as the default presentation).
skip_annotation is set by user directives that want the renderer to emit only
the primary plain text form. first_in_context is set when this is the first
occurrence of the hanja word within the current context window, where the
window is a block, a section, or the document depending on configuration.
from_dictionary distinguishes a dictionary match from a
character-by-character fallback; renderers may choose to mark these differently
for debugging.
The split between InputToken and OutputToken is the most consequential
shape decision in the IR. The alternative we considered was a single Token
enum that includes Annotated from the start, with adapters emitting
Annotated only as a future possibility. We rejected it for two reasons.
First, it would force the input-side Annotated variant to remain a no-op
throughout the reader and engine, polluting every match in the codebase with
an unreachable arm. Second, it would obscure the contract that the engine
produces annotations and renderers consume them. Two distinct types make
the dataflow legible at a glance and let the type system enforce that an
unrendered annotation cannot reach a writer.
Scope data
The opaque scope payload obeys a small trait:
The engine treats the current scope's is_preserve() as the single point of
truth for whether to skip a Text token. Adapters that need inheritance encode
the effective answer in each opened scope. The HTML adapter's ScopeData
implementation aggregates several concerns into that one flag: the inherited
lang attribute compared against a Korean predicate, the current tag and
preserved ancestors (against the preserved-tag list of pre, code, kbd,
script, style, and textarea), and any user-supplied predicate over raw
attributes. The engine itself does not know what a lang attribute is.
This is a departure from Seonbi, which threads a parallel LangHtmlEntity
annotation through the pipeline. We removed it because the engine never needs
both signals separately; it only ever asks the single question, is this text
skipped? Pushing lang inheritance into the adapter (where the tag-name
preservation list already lives) keeps the engine ignorant of HTML and keeps
the IR free of fields the engine does not consume.
Engine
The engine has three duties: identify convertible lexical spans that contain
hanja, segment those spans into dictionary and fallback edges, and emit
annotations and fallback text accordingly. Most spans are contiguous runs of
hanja, but dictionary entries may also include Korean native or hangul
fragments around hanja, as in 汽車길 or 色깔論. It also has a fallback
phoneticizer that maps single hanja characters to hangul and applies the
initial sound law where appropriate, and a numeral converter that translates
hanja digits to either hangul or Arabic numerals depending on configuration.
Lattice segmentation
Segmenting a hanja-containing span into dictionary words and fallback fragments is the engine's core decision, and the obvious algorithm is wrong.
The obvious algorithm is eager longest-match: scan left to right, at each
position take the longest dictionary entry that starts there, and continue from
its end. This is what Seonbi does. Its failure mode is mis-segmentation when a
longer prefix consumes characters that belong to a more natural split. Consider
a dictionary containing 行事, 行事場, 場所, and 入口. On the input
行事場入口, eager segmentation produces 行事場 + 入口, which is correct.
On the input 行事場所, eager segmentation produces 行事場 + 所, which
leaves 所 to the character-by-character fallback even though the segmentation
行事 + 場所 would have covered both segments with the dictionary.
The span is not limited to hanja-only text. Some entries in the Standard Korean
Language Dictionary mix native Korean and hanja, such as 汽車길 for
기찻길, 祭祀날 for 제삿날, 洗手대야 for 세숫대야, 火김 for
홧김, and 色깔論 for 색깔론. A dictionary edge may therefore consume a
mixed-script prefix from the current text cursor. Fallback edges, however, are
still created only for hanja characters that no dictionary edge covers; hangul
that is not part of a dictionary match passes through as ordinary text.
The correct algorithm is dynamic programming over a lattice. For each character
position i in the conversion span, the engine queries the dictionary for
every match that starts at i and considers a single-hanja fallback edge as a
backup when the current character is hanja. A ranking function compares the
alternatives; we choose the best segmentation by Viterbi-style backtracking
from the end of the span.
The ranking function is deliberately simple. It first maximizes the number of
characters covered by dictionary matches, so 行事 + 場所 beats 行事場 +
所 because the former leaves no fallback. Among paths with the same
dictionary coverage, it prefers fewer segments, so a whole-word match such as
天地 beats the component split 天 + 地. Remaining ties are kept
deterministic by preserving the first candidate that reached the same score.
The cost of lattice segmentation is bounded by the conversion-span length times the maximum dictionary entry length. In normal Korean text, spans that contain hanja are short: most are one to four characters, almost never more than ten, except for rare mixed-script dictionary entries. The dictionary's maximum entry length is on the order of ten characters. The per-span runtime is therefore tens of dictionary lookups, which the FST and CDB backends handle in microseconds.
An eager segmentation strategy remains available as an option for callers who do not need lattice accuracy and want to reduce per-span overhead. The default is lattice.
Fallback phoneticizer
When no dictionary entry covers a hanja character, the fallback phoneticizer
converts the character to hangul by looking up its canonical reading in the
embedded Unihan-derived character map. The character map is built from the
Unicode kHangul property and embedded in gukhanmun-core as a generated
sorted table queried by binary search. It covers thousands of hanja and keeps
the default build independent of network access.
For the first character of a word produced by the fallback, that is, the first
character of a fallback-only run or the first character after a dictionary
match, the initial sound law (頭音法則) optionally applies. The law converts a
small set of word-initial hangul syllables that originally began with ㄴ or ㄹ
into their South Korean orthography forms (녀 becomes 여, 려 becomes 여, 례
becomes 예, and so on). The conversion table is small (sixteen entries) and
lives in gukhanmun-core. The toggle for the law applies only to the fallback;
dictionary entries are assumed to encode the correct reading already (a South
Korean dictionary stores 來日 as 내일, a North Korean dictionary stores it as
래일).
Two small rules from Seonbi survive in the fallback because they are not
learnable from per-character mappings alone. First, the ryeol-ryul rule
(列, 律) is treated as part of the initial sound law: when the law is
enabled and these characters follow a syllable whose final jamo is ㄴ or
absent, their pronunciation glides to 열 or 율 rather than 렬 or 률. When the
law is disabled, as in North Korean orthography, they remain 렬 or 률. Second,
the hanja-numeral rule: a sequence of two or more hanja digits is read as a
single word with the initial sound law applied at its head and not at internal
positions. Both rules are encoded as small parsers over the character stream.
Numeral conversion
Hanja numerals are a special case because they can be converted to three
different surface forms and the right one depends on context. The library
exposes a NumeralStrategy option with four variants:
The hangul-phonetic strategy is Seonbi's behavior and the default for both
the ko-kr and ko-kp presets. The positional-arabic strategy treats a
digit-only sequence (〇一二三四五六七八九, plus their variants) as positional
notation and converts it to Arabic. The additive-arabic strategy parses
sequences that contain place markers (十百千萬億兆京) using stack-based
accumulation and produces Arabic, handling the Korean convention that 一 is
elidable before 十 (十一 means 11, not 一十一). The smart strategy
looks at the surrounding context: if a unit hanja follows (年月日時分秒號世紀
and so on), it uses additive-arabic; if not, and the run is pure digits of
length four or more, it uses positional-arabic (matching the year
convention); otherwise it falls back to hangul-phonetic.
Numeral conversion runs inside the fallback path on segments that the lattice
has identified as not matching the dictionary. The hangul-phonetic strategy
emits a fallback Annotated token, preserving the original hanja numeral so
renderers such as HangulHanjaParens can still show the source text. Arabic
numeral strategies may emit plain text instead, since their output is a numeric
normalization rather than a hangul reading of the source hanja.
Dictionaries
A dictionary is anything that can be asked what matches start at a given text position.
matches_at is the unusual part. The natural signature for a Korean text
application would be longest_match, but eager longest-match is exactly the
algorithm we reject in the engine. The trait surfaces every match starting at
a position because the lattice segmenter needs the full set in order to score
alternatives. The input string is the text suffix at the current cursor, not
only a pre-cut hanja run; this lets dictionaries contain mixed-script keys such
as 汽車길 as long as the match itself contains at least one hanja character.
A Match carries the matched byte length, the hangul reading, and a
MatchMark. byte_len is the UTF-8 length of the matched dictionary key,
which may include both hanja and hangul:
The marks come from the source dictionary's build-time metadata. The bundled Standard Korean Language Dictionary's CDB and FST files are accompanied by a rules file that enumerates hard-to-read characters and ambiguous readings that should be hanja-annotated.
Built-in implementations
UnihanCharDict exposes the per-character Unihan reading table as a
HanjaDictionary so callers can compose those readings through the same public
dictionary interface as other sources. It returns canonical pre-initial-sound-
law readings from a generated sorted table built from the Unicode kHangul
property; stateful fallback rules such as initial sound law and numeral
grouping remain engine behavior.
MapDictionary is the small in-memory dictionary used for tests,
programmatically supplied entries, and custom vocabularies that are already in
process memory. It is backed by an ordered map so it stays dependency-light and
usable in the no_std core crate. Callers that need compact serialized data,
mmap-friendly loading, or large static dictionaries should use the
gukhanmun-fst backend instead.
ChainDictionary composes a sequence of dictionaries with a precedence policy.
A caller can chain a small user dictionary (highest priority), a
domain-specific dictionary, the Standard Korean Language Dictionary, and, when
canonical single-character dictionary matches are desired, UnihanCharDict
(lowest priority).
External backends
Two external dictionary backends ship as separate crates.
gukhanmun-cdb wraps a CDB file as a HanjaDictionary. CDB is djb's constant
database: a static on-disk hash table with $O(1)$ lookups and trivial format
documentation. The naive use of CDB as a hanja-to-hangul map fails because CDB
is a hash table without prefix iteration; there is no way to ask whether some
key starts with a given byte sequence. We work around this by encoding the
dictionary as a trie embedded in the CDB key space. At build time,
gukhanmun-mkdict enumerates every prefix of every entry and stores a record
for each, with a one-byte flag distinguishing complete words from intermediate
prefixes:
Lookup walks one character at a time from the cursor position. On a miss, no
longer match is possible from this position and the walk terminates. On a hit
with is_complete = 1, the match is yielded; on a hit with is_complete = 0,
the walk continues to look for longer matches. The cost is
$O(\text{max\_word\_chars})$ CDB lookups per position, and each CDB lookup
is $O(1)$. The size cost is real: every prefix of every entry occupies a
record, so the bundled stdict CDB is roughly twice the size of the source TSV.
The trade-off is that CDB's simplicity (six syscalls of file format,
public-domain reference implementations) makes the backend trivially auditable.
gukhanmun-fst wraps an fst::Map as a HanjaDictionary. The FST (finite
state transducer) supports prefix iteration natively and compresses better than
CDB-as-trie, but its on-disk format is less universally implemented. We provide
both because users have different priorities; CDB is the choice for
code-auditability and trivial mmap support, while FST is the choice for small
WebAssembly bundles.
Dictionary tooling
gukhanmun-mkdict is a separate CLI for building CDB and FST dictionary files. It accepts TSV, CSV, and JSON Lines input, supports merging multiple input files with a configurable conflict policy, validates the result with a round-trip pass, and embeds build metadata (source, license, build date) in the dictionary header.
A rules file is a TSV with columns kind, pattern, require_hanja,
require_hangul, reason. The kind selects one of three matchers:
entry (exact hanja key), contains (any entry whose hanja key contains
the hanja substring pattern), or reading (any entry whose hangul reading
equals pattern). contains patterns must be hanja-only because dictionary
keys can be mixed-script (e.g. 布告하다); accepting a hangul substring
would silently mark unrelated entries. Multiple rules touching the same
entry are OR-merged on the mark bits. A rule must set at least one of
require_hanja / require_hangul, must carry a non-empty reason, and
must match at least one entry; unmatched rules fail the build by default so
that the rules file does not drift away from the dictionary it annotates.
--allow-unmatched-rules is a hatch for partial dictionaries that share a
rules file with a larger build.
The same binary is invoked by gukhanmun-stdict's build script to produce the bundled Standard Korean Language Dictionary FST file, so end-users and the library itself share the build path. The CDB backend uses the same normalized inputs and validation path for user-built dictionaries. Bugs in the build path get caught by the library's own integration tests before they reach users.
Middlewares
The renderer's input is an OutputToken stream where each Annotated token
carries flags. The flags describe what the engine knows about the annotation:
was it from the dictionary, is there a homophone, is this the first time we
have seen this word. What they do not describe is what the renderer should do
about it; that is set by middlewares.
Splitting policy (which annotations should be presented with hanja, which with
hangul only, when does “first” reset) from form (parentheses, ruby, hangul
only) gives us a cleaner pipeline than Seonbi has. In Seonbi, the rendering
function decides both what to present and how. As a result, the
homophone-disambiguating renderer must internally re-implement a
homophone-detecting pass, and there is no way to swap in a different homophone
heuristic without rewriting the renderer. In Gukhanmun the middlewares are
stateful filters on the OutputToken stream, and the renderer is a usually
stateless translator from Annotated tokens to concrete text and markup.
Built-in middlewares
HomophoneMarker scans the stream and sets homophone = true on annotations
whose hangul reading is shared by another hanja form in the effective
dictionary entry set or within the configured context window. It builds a
single reading-to-hanja index from HanjaDictionary::entries() when the
backend exposes entries; lookup-only dictionaries fall back to
has_homophone() and still get context-local marking. The window is one of
per-block (default), per-document, or off. Per-block windows buffer only
until the next scope whose
is_block_boundary() returns true, which is typically a paragraph or a list
item. Per-document windows buffer the whole stream and are appropriate only
when the input is small or when full accuracy matters more than latency.
Plain text has no block scopes, so per-block is document-wide there. This is
intentional: if 漢字 appears on line 1 and 翰字 appears on line 100, both
lines must render with disambiguating hanja, and a streaming writer cannot
retroactively rewrite line 1 after line 100 is seen.
FirstOccurrenceFilter clears require_hanja and require_hangul on
annotations after their first occurrence within a configurable context, leaving
the first occurrence as-is so the reader still encounters the gloss once. The
context is one of per-block, per-section, or per-document. The section
variant resets at any heading boundary, which the HTML and Markdown adapters
expose through is_section_boundary() on heading scopes.
UserDirectives applies a user-supplied set of rules. A rule is a predicate
over the hanja form plus an action: set require_hanja, set require_hangul,
or skip the annotation entirely (which collapses to plain hangul or plain hanja
text depending on the active renderer). The rule predicate may be a literal
string set, a glob, or an arbitrary closure for Rust callers; JavaScript
callers expose only the literal-set form to avoid the cost of per-token
cross-boundary calls.
Custom middlewares
A middleware is an impl Iterator<Item = OutputToken<S>> taking an upstream
iterator. The trait surface is small enough to write inline. Users who want,
for example, to mark technical-term annotations from a glossary write a
middleware that holds the glossary set and updates require_hanja on hits.
Renderers
The renderer expands Annotated tokens into concrete Text, Open, and
Close tokens according to its mode and the annotation's flags. Five renderers
ship with the library.
The HangulOnly renderer emits the hangul reading alone. If require_hanja or
homophone is set, it emits 한글(한자). If require_hangul is set, the
result is already hangul, so nothing changes.
The HangulHanjaParens renderer always emits 한글(한자). The
require_hangul flag is satisfied by the hangul half, and require_hanja by
the hanja half.
The HanjaHangulParens renderer always emits 한자(한글). This is useful for
academic and historical-document styles that lead with hanja.
The Ruby renderer emits a <ruby> element with a sub-mode that determines
which side is the base: <ruby>한글<rt>한자</rt></ruby> for on-hangul, and
<ruby>한자<rt>한글</rt></ruby> for on-hanja. If the current scope returns
false from allows_inline_markup(), the renderer falls back to parens.
The Original renderer emits the original hanja as plain text. Only
annotations with require_hangul, or those marked by a user directive, receive
a gloss; the gloss appears in either parens or ruby form depending on a
sub-option. This is the mode for “keep mixed script, gloss only the difficult
characters”, which is the style this very design document uses on its Korean
edition.
Renderers are pure functions over a single Annotated token plus the current
scope's allows_inline_markup() value. They produce a small fixed-size
sequence of output tokens (one for plain hangul or hanja, three for parens,
between five and nine for ruby). They contain no state and no buffering.
The renderer is the right place to make form decisions because the form depends
on what other tokens are in the stream at the same scope position; for example,
<ruby> inside <pre> is wrong, and the decision needs the scope stack, which
is what flows through the IR. Putting form decisions elsewhere would either
duplicate the scope tracking or require the renderer to know about other
middlewares.
Format adapters
Three adapters ship with the library. A new adapter requires only Reader and
Writer implementations that translate between the format's tokens and the IR.
HTML
The HTML adapter implements a hand-written scanner that produces
InputToken<HtmlScopeData> events. The scanner is a near-direct port of
Seonbi's Text.Seonbi.Html.Scanner module, which has been used on real-world
Korean web content for several years. It is fragment-oriented: the input may be
a complete document, a body, or a fragment, and the scanner emits the events it
sees rather than attempting tree construction.
We considered using html5ever or lol_html. html5ever is the reference HTML5 parser for Rust; it produces a DOM and handles every edge case in the HTML5 specification. lol_html is Cloudflare's streaming HTML rewriter; it is selector-driven and integrates with WebAssembly well. We chose a hand-written scanner because the Seonbi approach has two specific virtues that fit our model. First, it preserves raw attribute strings rather than parsing them, which means the writer can serialize an unchanged scope exactly as it appeared in the input. Second, it is small enough to fit comfortably in a WebAssembly bundle, where bringing along html5ever would dominate the binary size. The cost is that our scanner is not HTML5-conformant; we accept the trade-off.
The scanner recovers from minor errors. Unclosed tags pop the most recently
opened scope of the same name, or, if none, are emitted as text. Unrecognized
constructs that begin with < but are not valid tag, comment, or CDATA starts
are emitted as text characters. Entirely malformed input still produces a token
stream, just one with structural anomalies; the engine is robust to anomalies
(it does not assume scope matching) and the writer emits whatever scopes it
received.
HtmlScopeData::is_preserve() returns true if the current element is one of
pre, code, kbd, script, style, or textarea, or if the inherited
lang attribute is not Korean. The lang inheritance is computed inside the
adapter: each Open event evaluates its raw attributes for a lang value, and
the adapter maintains its own lang stack. This is the only place in the
codebase that knows what lang means, and the only place that knows the
preserved-tag list. The engine receives the consequences as is_preserve() and
does not reproduce either rule.
Markdown
The Markdown adapter is a thin layer over pulldown-cmark. Each
pulldown-cmark::Event becomes one or more IR tokens. Start(Tag) and
End(Tag) become Open and Close. Text becomes Text. Code becomes
Verbatim. Inline HTML, which pulldown-cmark exposes as Event::Html, is
passed through a second pass of the HTML scanner so that constructs like
<q lang="ja"> inside a paragraph receive proper lang handling.
Output is via pulldown-cmark-to-cmark, which serializes an event stream back
to Markdown. Semantic preservation is contractual: the output, when re-parsed,
produces the same logical structure. Byte-for-byte preservation is best effort:
setext headings may become ATX, link reference definitions may be inlined, soft
breaks may be regularized. Users who require byte-fidelity should process
rendered HTML rather than Markdown.
We chose pulldown-cmark rather than writing a Markdown parser because the
work of CommonMark conformance is large and well-handled by pulldown-cmark,
and because the event-stream API is a near-perfect match for our IR. The
alternative was markdown-rs, which is more recent and produces an AST rather
than an event stream; we preferred event streams for the streaming property.
Plain text
The plain-text adapter wraps the entire input in a single scope and emits one
Text token. The output is the concatenation of Text tokens. Ruby rendering
is not meaningful in plain text and falls back to parens. The CLI can stream
plain text before EOF only when no document-wide middleware can change already
rendered output. In practice, the ko-kp preset streams because homophone
marking is off, while the default ko-kr preset keeps the plain-text output
until EOF so cross-line homophones are rendered correctly.
Distribution
Rust workspace
The Rust source is organized as a Cargo workspace with the following layout:
- Cargo.toml: workspace manifest
- DESIGN.md: symbolic link to DESIGN.en.md
- DESIGN.en.md, DESIGN.ko-Kore.md: design documentation
- crates/
- gukhanmun-core/: IR types, engine, dictionary trait, lattice
segmenter, fallback phoneticizer, initial sound law tables, embedded
UnihanCharDict. No I/O, no format-specific code, minimal dependencies. Suitable forno_stdenvironments withalloc. - gukhanmun-html/: HTML scanner and serializer.
HtmlScopeDataimplementation. - gukhanmun-markdown/: Markdown adapter atop
pulldown-cmark. - gukhanmun-cdb/: CDB-trie dictionary backend.
- gukhanmun-fst/: FST dictionary backend.
- gukhanmun-stdict/: bundled Standard Korean Language Dictionary as an embedded FST byte array.
- gukhanmun-mkdict/: CLI for building CDB and FST dictionaries from TSV, CSV, or JSONL inputs.
- gukhanmun/: the umbrella library crate. Re-exports from the others
under feature flags, exposes the high-level
BuilderAPI, defines the umbrellaErrorenum. - gukhanmun-cli/: the
gukhanmuncommand-line binary. - gukhanmun-wasm/: WebAssembly bindings via
wasm-bindgen. - gukhanmun-napi/: Node-API bindings via
napi-rs.
- gukhanmun-core/: IR types, engine, dictionary trait, lattice
segmenter, fallback phoneticizer, initial sound law tables, embedded
The umbrella crate's feature flags compose the others. Default features enable
HTML, Markdown, and the bundled stdict. CDB and FST are individually
selectable. Disabling everything yields a Rust-API-only build with just the
engine and UnihanCharDict, suitable for embedded targets.
JavaScript packages
The JavaScript side is split into a type-only package and one package per runtime implementation:
- @gukhanmun/types (npm and JSR): TypeScript interfaces, type aliases,
error class declarations, and the
GukhanmunFactoryinterface. Contains no runtime code; npm emits declarations only, while JSR receives the .ts source directly. This is the canonical API contract; both implementations satisfy it structurally. - @gukhanmun/wasm (npm and JSR): WebAssembly implementation. Re-exports the
types. Loads its .wasm artifact via
import.meta.urlso that Deno and browsers can resolve it natively and Node 22+ can resolve it via the standard ESM loader. - @gukhanmun/napi (npm only): Node-API implementation. Re-exports the types. Ships per-platform prebuilt binaries through napi-rs's optional-dependency packaging.
The data dictionaries ship as separate packages so that the runtime bundle stays small:
- @gukhanmun/stdict-fst (npm and JSR): the bundled stdict as an FST file,
exported as a
Uint8Array. - @gukhanmun/stdict-cdb (npm and JSR): the same dictionary as a CDB file.
- @gukhanmun/stdict-min (npm and JSR): a reduced FST containing only homophonous entries and ambiguous readings, for size-sensitive contexts.
The reason for the type-only package is that the canonical API contract should
live in exactly one place. If the contract lived in @gukhanmun/wasm and
@gukhanmun/napi duplicated the types or imported them from one another,
version skew between the implementations would become a maintenance burden.
With @gukhanmun/types as a peerDependency of both implementations, users
get a single source of truth and a single set of types regardless of which
runtime they pick. We considered, and rejected, two alternatives: keeping the
types in the WASM package and having NAPI depend on it directly (asymmetric,
makes NAPI subordinate to WASM); and duplicating the types in both packages
(drift between copies, no obvious home for the canonical TSDoc comments).
Option enums on the JavaScript side are string union types rather than
const-asserted objects. This keeps @gukhanmun/types genuinely type-only: it
emits zero bytes of runtime code, which matters for bundles and which means the
JSR package's source can be a single .ts file with no transpilation step. The
trade-off is that the option strings are stringly typed at runtime; both
implementations validate them at the boundary and throw a GukhanmunError with
code invalid-input on unrecognized values.
Streaming on the JavaScript side uses the platform
TransformStream<string, string> interface, which is available across
browsers, Deno, Node 18+, and Bun. Chunks are JavaScript strings; encoding
concerns (TextDecoderStream, TextEncoderStream) live outside the gukhanmun
stream. Within the engine, a chunk that ends in the middle of a conversion
span, or close enough to a hanja character that a mixed-script dictionary key
could still cross the boundary, causes that trailing span to be held until the
next chunk arrives; everything before that point is flushed eagerly. The
dictionary lookahead part of that buffer is bounded by the dictionary's
max_word_chars plus a small constant for the lattice's outgoing state,
typically a few dozen characters. Fallback-only hanja runs are deliberately not
split at chunk boundaries, because render modes that show source hanja expose
annotation grouping; those runs flush at a later non-convertible boundary or
EOF.
Stateful middlewares can add their own lookahead requirement. A homophone marker
with a document-wide context, including plain-text per-block where no block
scopes exist, must buffer until EOF to preserve exact rendering. Disabling that
middleware restores early streaming but also disables cross-line homophone
disambiguation.
We chose strings rather than Uint8Array for the streaming type because the
engine fundamentally works on Unicode scalar values: byte-level chunking would
force the adapter to do partial-codepoint reassembly at every boundary, which
the platform's TextDecoderStream already does correctly. Users who have a
byte stream chain it through TextDecoderStream and then through the gukhanmun
transform.
Dictionary configuration
The JavaScript dictionary configuration accepts either a file source or an in-memory map:
The two variants are distinguished by instanceof Map at runtime, and by
structural typing at compile time. The Map form is convenient for small
custom vocabularies created in code; the file form is for shipped dictionaries.
Registry matrix
CLI binaries ship as platform releases (Linux x86_64 and aarch64, macOS arm64 and x86_64, Windows x86_64) attached to the GitHub Releases for each version.
Versioning is lockstep across all packages. Every release tag advances every crate's version (in the Rust workspace) and every JavaScript package's version in tandem. Some packages have no functional change at a given release; their version still advances so that the cross-language story is unambiguous. We chose lockstep over per-package semver because the cost of mis-coordinated dependency ranges (a user installing @gukhanmun/wasm@1.2 with @gukhanmun/types@1.3 and getting confusing type errors) outweighs the cost of an occasional no-op version bump. The CI workflow that fires on tag push publishes to crates.io, builds the per-platform NAPI prebuilts in parallel, publishes to npm, and publishes to JSR. Re-running a publish on the same tag is a no-op against the registries that reject overwriting.
Engineering policies
Errors
Each crate defines its own error enum via thiserror. The umbrella gukhanmun
crate aggregates them with #[from] so that callers can use ? across crate
boundaries without manual conversion. The pattern:
The #[non_exhaustive] attribute lets us add new variants in minor releases
without breaking callers; downstream match expressions are required to have a
wildcard arm. Each variant carries enough structured data to drive both
human-readable messages and machine consumers. std::error::Error::source()
chains are preserved through #[from] and #[source] so that walking an error
gives a complete trace.
Library crates do not use anyhow. The CLI does, because the CLI's job is to
print errors to a human, not to be inspected by other code.
The stream-level recovery policy is configurable. The default is
Recovery::Strict: the engine propagates any reader error and stops.
Recovery::Lenient causes the engine to log the error via tracing and emit a
Verbatim token for the unrecognized region so that downstream tokens still
flow.
On the JavaScript side, errors are a single class with a discriminant code:
The bindings walk the Rust source() chain at the FFI boundary and materialize
a chain property on the error, so JavaScript callers can inspect causes
without needing further FFI calls.
Logging
gukhanmun-core and its siblings depend on the tracing crate
unconditionally. Library code uses tracing::trace!, tracing::debug!,
tracing::info!, tracing::warn!, and tracing::error! directly. The
overhead when no subscriber is registered is one atomic load and a branch per
call site, well under any threshold worth optimizing for.
Binaries that want to compile out the calls entirely (the WebAssembly build is
the obvious case) enable tracing's release_max_level_off feature in their
own Cargo.toml. That feature works at the binary level: it replaces every
tracing::*! invocation in every dependency with a no-op at compile time,
without requiring library code to be reconfigured.
We considered adding a library-level feature flag to make the tracing
dependency itself optional, with stub macros for the off path. The added
complexity (a log module per crate that conditionally re-exports tracing's
macros or provides stubs) is larger than the binary savings; we will revisit if
WebAssembly bundle measurements call for it.
Testing
The test suite has four parts.
Regression fixtures cover specific bug shapes that have appeared in Seonbi or in early Gukhanmun development. The relevant subset of Seonbi's test/data/ directory ports over directly; each fixture is a pair of input and expected-output files in HTML or Markdown, with a configuration sidecar.
Snapshot tests use insta to compare IR serializations against a stored
JSON. They are most useful for the engine and middleware crates, where the
input and output are token streams rather than text. A failed snapshot prints a
colored diff and offers an interactive accept-or-reject prompt.
Property-based tests use proptest to assert invariants over generated
inputs. The two invariants that matter most: reader-then-writer roundtrips a
token stream losslessly (modulo the documented Markdown best-effort caveats);
and engine-then-renderer applied to plain-hangul input is a no-op (the engine
should not invent annotations from text without hanja).
Conformance tests run the Markdown adapter against a selected subset of the CommonMark specification examples to verify that the adapter does not break syntax that Gukhanmun is not interested in changing.
CI runs all four under stable, beta, and the MSRV (minimum supported Rust version) of the workspace. The WASM build is also exercised for size regressions: a fixed size budget per artifact is enforced, and a regression that exceeds it fails the build.
Presets
The ko-kr preset matches the orthographic and lexical conventions of South
Korea: dictionary-driven readings, lattice segmentation, the initial sound law
applied to fallback fragments, and per-block homophone disambiguation that
emits hanja in parentheses when the reading is ambiguous within a paragraph.
The ko-kp preset matches the North Korean convention of writing Sino-Korean
words in hangul without the initial sound law applied (래일, 류행, 녀자), with
no bundled dictionary because the South Korean stdict's readings would be
incorrect for ko-KP.
The CLI exposes both as --preset ko-kr and --preset ko-kp. Individual
options remain settable to override the preset, for example
--preset ko-kr --no-stdict disables the bundled dictionary while keeping the
other South Korean defaults.
Initial sound law table
The South Korean orthography (한글 맞춤법, Clause 5, Section 52, Chapter 6)
converts a small set of word-initial hangul syllables. The table is reproduced
from Seonbi's Text.Seonbi.Hanja module and is the source of truth for
gukhanmun-core's initial_sound_law_table constant.
Hanja numeral table
The fallback phoneticizer and the additive-arabic numeral strategy share a
single table of digit and place-marker hanja with their canonical values.
The additive-arabic strategy treats place markers as multipliers and adjacent
digit hanja as multiplicands, with the Korean elision rule that bare 十,
百, 千 mean 10, 100, 1000 respectively (not 一十, 一百, 一千).
HTML preserved tags
The default HtmlScopeData::is_preserve() returns true for the following tag
names regardless of attribute content: pre, code, kbd, script, style,
textarea.
It additionally returns true when the inherited lang attribute's primary tag
is anything other than ko, kor, or a subtag-prefixed Korean form (ko-KR,
ko-Hang, ko-Kore, kor-KP, and so on). The Korean predicate matches
Seonbi's isKorean.
Users who want to extend the list (for example, to add a project-specific
class="no-translate" attribute) pass an HtmlReaderOptions value with a
preserve_when predicate to read_html_fragment_with_options. The predicate
receives an HtmlElementInfo view of each opened element — its canonical tag
name, the raw attribute slice from the start tag, and the inherited lang
value — and returns true to preserve the scope. A predicate-matched scope
inherits its preserve flag to descendants, mirroring how the built-in
preserved tags propagate, so callers do not have to re-assert the rule on
every child. The CLI exposes the two most common shapes of this hook as
--html-preserve-class CLASS and --html-preserve-attr KEY[=VALUE] (both
repeatable, OR-composed, valid only with --format text/html). A
format-neutral skip closure on EngineOptions is contemplated for a future
release; it is not implemented today because every adapter that currently
ships can satisfy its preserve needs through its own ScopeData.
CDB-trie key scheme
A CDB-trie database stores one record per prefix of each entry. The key is the UTF-8 bytes of the prefix. The value layout:
A separate well-known key, __gukhanmun_meta__, stores a small CBOR document
with build metadata: source, license, build date, original entry count, prefix
count, maximum entry length in characters and bytes.
FST schema
The FST database stores one entry per dictionary word. The key is the UTF-8 bytes of the hanja form; the value is a 64-bit integer whose layout is:
The reading-string table is a contiguous block of UTF-8 bytes following the FST
itself. The side table is necessary because the FST value type is fixed-size;
we cannot store variable-length readings inline. The mark byte sits in the
value rather than the side table because checking it is hot for every lookup.
A metadata header at the file start (eight bytes of magic, version, layout offsets, and a CBOR metadata blob analogous to the CDB form) precedes the FST bytes.