Dictionary format
This document specifies the normalized dictionary input formats consumed by
gukhanmun-mkdict and the FST/CDB dictionary file layouts produced from those
inputs. It is the boundary between dictionary extractors, user-maintained
glossaries, and backend builders.
Canonical TSV input
The primary input format is UTF-8 TSV with a required header row. Each file must contain at least these columns:
These optional columns are recognized:
Boolean values are true, false, 1, or 0. Empty optional boolean cells
are treated as false.
Additional columns such as category, source, or note are allowed as
future extension space. The current builder does not consume them; it emits a
short warning and ignores their values.
Example:
CSV and JSONL input
gukhanmun-mkdict chooses the input parser from each file extension. .tsv
uses the canonical TSV parser, .csv uses the CSV parser, .jsonl uses the
JSON Lines parser, and unknown extensions are treated as TSV for compatibility.
CSV files use the same header names as TSV:
Each JSONL line is one object. The boolean fields accept either snake_case or camelCase spellings:
Merge policy
When multiple input files are supplied, or when one input file repeats a
hanja key, gukhanmun-mkdict applies the configured merge policy:
error: fail on the first duplicate key. This is the default.first-wins: keep the first entry and ignore later duplicates.last-wins: replace earlier entries with the last duplicate.
After merging, entries are sorted by UTF-8 key bytes before backend encoding. This makes generated FST and CDB artifacts deterministic for the same normalized inputs and metadata.
Build metadata
Every generated backend file embeds CBOR metadata. The minimum keys are:
entry_count, version, max_word_chars, max_key_bytes, and
prefix_count are reserved and cannot be supplied with --metadata. Other
--metadata KEY=VAL pairs are preserved as string values.
For reproducible builds, the builder does not use the current clock by default.
If build_date is not passed explicitly and SOURCE_DATE_EPOCH is set, the
epoch is formatted as UTC RFC 3339. If neither value is present, the fixed date
1970-01-01T00:00:00Z is used.
FST backend file
The first implemented backend format is fst. The file contains:
- A fixed 64-byte little-endian header.
- A CBOR metadata map.
fst::Mapbytes.- A contiguous UTF-8 reading string table.
The fixed header fields are:
Each FST key is the UTF-8 bytes of the hanja column. Each FST value is a
64-bit integer:
The reading bytes are stored in the reading string table, not in the FST value, because FST values are fixed-width integers.
At runtime, gukhanmun-fst decodes the fixed header and CBOR metadata eagerly.
The FST map bytes and contiguous reading table then share a single byte backing
store: owned heap bytes for FstDictionary::open() and
FstDictionary::from_bytes(), static bytes for
FstDictionary::from_static_bytes(). The crate intentionally does not expose a
safe file-backed mmap loader because Rust cannot enforce that another file
descriptor or process will keep the mapped file immutable while the dictionary
is live.
entries() and has_homophone() enumerate the FST and therefore remain
full-dictionary scans; callers that need repeated homophone checks should use
the core homophone middleware's batch index.
CDB backend file
The CDB backend format is cdb. It uses a trie embedded in CDB key space:
each prefix of each dictionary key is written as a CDB key. Prefix records
that are complete dictionary entries carry the reading and mark bits; prefix
records that only lead to longer entries carry no reading.
The metadata CBOR map is stored under the reserved key
__gukhanmun_meta__.
The runtime CDB backend uses the cdb crate's file-backed reader and keeps the
backend opaque behind HanjaDictionary. This already avoids loading the whole
CDB file through a Gukhanmun-owned buffer without adding an unsafe mmap surface.
Validation
With --validate, gukhanmun-mkdict writes the output, opens it again with
the selected backend, and checks that every merged entry can be recovered with
the same reading and mark bits. Validation failure is fatal.