Conversion options

These flags control the linguistic rules applied during conversion.

Preset

--preset selects a preconfigured combination of defaults:

PresetDictionaryInitial sound lawHomophone windowUse case
ko-kr (default)Bundled stdictEnabledPer-blockSouth Korean orthography
ko-kpNoneDisabledOffNorth Korean orthography
gukhanmun --preset ko-kp input.txt

Individual flags below override the preset's defaults.

Segmentation strategy

--segmentation controls how word boundaries are found:

  • lattice (default): finds the globally optimal segmentation by evaluating all dictionary matches at every position with dynamic programming. Best for accuracy.
  • eager: greedy left-to-right longest-match. Faster but may mis-segment compound words.
gukhanmun --segmentation eager input.txt

Numeral handling

--numerals controls how hanja numerals are rendered:

Strategy二〇一六年十一月一千二百三十四
hangul-phonetic (default)이공일륙년십일월일천이백삼십사
positional-arabic2016년
additive-arabic11월1234
smart2016년11월1234
gukhanmun --numerals smart input.txt

Initial sound law

The initial sound law (頭音法則) is enabled by default for ko-kr and disabled for ko-kp. It affects character-by-character fallback readings for characters not found in any dictionary; dictionary entries already encode their correct readings.

InputLaw enabled (ko-kr)Law disabled (ko-kp)
來日내일래일
理由이유리유
女子여자녀자

Override with explicit flags:

gukhanmun --no-initial-sound-law input.txt  # disable
gukhanmun --initial-sound-law input.txt     # enable (redundant for ko-kr)

Homophone disambiguation

When the same hanja appears multiple times in a window, Gukhanmun can mark repeated occurrences so readers can tell them apart. --disambiguation sets the scope of that window:

ValueBehaviour
offNo disambiguation
per-block (default for ko-kr)Reset at paragraph/list/heading boundaries
per-sectionReset at heading boundaries
per-documentTrack across the entire input
gukhanmun --disambiguation per-section input.txt

First-occurrence clearing

--first-occurrence removes annotations from characters whose presentation was already forced earlier in the window:

ValueBehaviour
off (default)Never clear
per-blockClear within a paragraph/block
per-sectionClear within a section
per-documentClear across the entire document
gukhanmun --first-occurrence per-section input.txt

Error recovery

--recovery controls behaviour when an unrecoverable parse error occurs (currently relevant for HTML input only):

  • strict (default) — abort with an error
  • lenient — skip the problematic fragment and continue
gukhanmun -f text/html --recovery lenient input.html