Variations on Japanese romanization
Many times, Japanese names, titles, and phrases need to be converted into text in Latin letters for various good reasons. (Writing to an English-speaking audience, using computer software that only handles file names in ASCII, etc.) As I explored this problem, I found that there were many subtle variations of Japanese romanization used in the wild, each with a valid reason for existing. In this article I will try to give a near-complete overview of all the reasonable variations on how to romanize Japanese text.
Contents
- Basic kana
- Long vowels
- Kana n
- Small tsu
- Particles
- Spaces
- Hyphens
- Capitalization
- Foreign words
- Small kana
- Non-standard dakuten
- Losslessness
Basic kana
There are two major styles for romanizing kana: Nihon-shiki versus Hepburn. Nihon-shiki has a very uniform structure of consonant (plus optional y) plus vowel, whereas Hepburn conveys the pronunciation more accurately at the expense of irregular spelling. For example, the romanization of the T line in Nihon-shiki is ta ti tu te to, but in Hepburn it’s ta chi tsu te to. The following list shows the kana (and the only kana) whose romanizations are different in Nihon-shiki versus Hepburn. Each entry respectively states the kana, the Nihon-shiki romanization, the Hepburn romanization, and any alternate romanizations:
- し: si, shi
- じ: zi, ji
- ち: ti, chi
- ぢ: di, ji, dji
- つ: tu, tsu
- づ: du, zu, dzu
- ふ: hu, fu
- しゃ: sya, sha
- しゅ: syu, shu
- しょ: syo, sho
- じゃ: zya, ja, jya
- じゅ: zyu, ju, jyu
- じょ: zyo, jo, jyo
- ちゃ: tya, cha
- ちゅ: tyu, chu
- ちょ: tyo, cho
- ぢゃ: dya, ja, dja
- ぢゅ: dyu, ju, dju
- ぢょ: dyo, jo, djo
Note: The forms {dji, dzu, dja, dju, djo} are modified from Hepburn and are for disambiguation. The forms {jya, jyu, jyo} are in between Hepburn and systematic romanization.
Long vowels
In spoken and written Japanese, there are words that differ only by the length of a vowel. There are two vowel lengths: single and double. Distinguishing this in rōmaji is an important goal, although not absolutely critical.
- Macron
-
A common scheme used in Japanese textbooks for English-speaking learners and on Wikipedia. Easier to understand than wāpuro.
e.g. Tōkyō, Ōsaka, sensē, onēsan, onīsan, okāsan, yūbe - Circumflex
-
A simple variation on the macron scheme. Possibly invented because some typesetting systems don’t support the macron diacritic.
e.g. Tôkyô, Ôsaka, sensê, onêsan, onîsan, okâsan, yûbe - Wāpuro
-
A very popular scheme outside of formal publications, found especially in the anime fansub, manga scanlation, and file sharing communities. Preserves the original orthography to the best extent out of all the schemes, which helps if the text needs to be converted back into Japanese for search/
correlation/ etc.
e.g. Toukyou, Oosaka, sensei, oneesan, oniisan, okaasan, yuube - Doubling
-
Similar to wāpuro but favors pronunciation rather than kana spelling. I doubt that it’s used in the wild.
e.g. Tookyoo, Oosaka, sensee, oneesan, oniisan, okaasan, yuube - Conflate with short vowels
-
Not used much in actual romanized sentences, but used very often in the official romanization of city names, etc.
e.g. Tokyo, Osaka, sense, onesan, onisan, yube - oo/ou as oh
-
Common in situations where diacritics are not expected, such as in typical English writing. It’s rather ad hoc, and I think it looks ugly.
e.g. Tohkyoh, Ohsaka
All schemes that map the long vowels おう and おお to the same sequence of letters are inherently ambiguous. おう is the most common spelling in most Japanese words, but おお does arise occasionally, and it is critical in Japanese writing to distinguish the two.
For katakana long vowels, all the above options are applicable plus a few more. Whereas a hiragana long vowel uses a vowel kana as the second letter (e.g. くう), a katakana long vowel uses a horizontal mark as the second letter (e.g. クー). For example using the above schemes, セーラー can be romanized as sērā, sêrâ, seeraa, or sera. Additionally:
- Hyphen (wāpuro)
-
This reflects how katakana long vowels are entered into IMEs.
e.g. se-ra- - Foreign spelling
-
This looks far more natural in romanized text. But it requires the reader to know the English/
foreign pronunciation and map it back to katakana sounds on the fly.
e.g. sailor
Kana n
The kana ん requires some care because its pronunciation changes in front of some consonants, and because n + vowel is not the same as a single syllable (such as んう vs. ぬ).
- Always use n
-
Common.
e.g. sankaku, sanpo, senpai - Sometimes use m
-
The old Hepburn scheme uses m when the next kana is b- or p-.
e.g. sankaku, sampo, sempai - Use n’ (apostrophe)
-
Either always use n’, or drop the apostrophe in non-ambiguous cases.
e.g. san’kaku/sankaku, san’po/sanpo, sen’pai/senpai, ren’ai, Jun’ichi - Always use nn
-
This is an artifact from input method editors (IMEs), but is never used in writing because it causes massive confusion with popular practices.
e.g. sannkaku, sannpo, sennpai, rennai, Junnichi
Small tsu
- Double the previous consonant
-
This is essentially the universally adopted scheme. It generally works well enough but breaks down in some minor, esoteric edge cases, which are discussed later.
- Treat as an individual character
-
Romanize as xtu, xtsu, or otherwise. This is helpful if the goal is lossless romanization.
Particles
Three particles in Japanese have a different pronunciation than what their written kana symbol suggests. They are は, へ, を.
- Favor orthography
Romanize as ha, he, wo. This is good for lossless conversions.
- Favor phonetics
Romanize as wa, e, o. This is good for reading romanized text aloud.
- Favor phonetics but preserve wo
Romanize as wa, e, wo. This preserves the distinction between o and wo. Furthermore, を can be pronounced as wo in songs. (But there are a number of other permitted liberalities in song lyric pronunciation, which I will not elaborate on.)
Spaces
Native Japanese text has no spaces. But in such text, a kana followed by a kanji strongly suggests a word boundary in between, which works well in practice. In rōmaji this is not possible, so spaces are necessary. Furthermore, every language that uses the Latin alphabet uses spaces in normal text.
- Japanese example with full kanji
-
例:貴方は家で御飯を食べたか。
- Space between words, including particles
-
This is the most common scheme.
e.g. anata wa ie de gohan wo tabeta ka. - Space between words, no space before particle
-
This mimics kana spacing for beginners in native Japanese.
e.g. anatawa iede gohanwo tabetaka.
例:あなたは いえで ごはんを たべたか。 - Space between each kana
-
Possibly helpful for beginners, good for lossless transliteration, but not used in practice.
e.g. a na ta wa i e de go ha n wo ta be ta ka - No space
-
This removes the work needed to find word boundaries in the Japanese text, but results in hard-to-read romanized text. However, the romanization is still lossless and unambiguous, since Japanese text does not have spaces to begin with.
e.g. anatawaiedegohanwotabetaka
Hyphens
Closely related to spacing is hyphenation. Words that have a loose relationship can be joined with a hyphen instead of a space. Examples:
- Numbered items: dai-ichi (第一), ni-chōme (二丁目), san-jū (三十), yon-mai (四枚), go-ji (五時)
- Honorific suffixes: -san, -sama, -chan
- Honorific prefixes: o-, go-
Capitalization
Base techniques:
- Everything in lowercase
Simple, easy to read. Popular.
- Hiragana in lowercase, katakana in uppercase
Lossless. This scheme is sometimes used in fan-made song lyric romanizations.
Further considerations:
- Capitalize the first word in a sentence
-
Just like in European languages.
e.g. Kore ga watashi no go-shujin-sama desu. - Capitalize proper nouns (names, etc.).
-
That is, capitalize the names of people/
places/ companies/ products/ publications/ etc.
e.g. Mary-san to Yuki-san wa senshuu London e ikimashita. - Title case for titles
-
Capitalize every word, or only capitalize significant words?
e.g. Pikachu wa Genki Desu ne
Foreign words
Foreign/
- Romanize as kana
Systematic but extremely ugly, and hard to recognize even for people who can read English.
e.g. pāsonaru konpyūtā, kurisumasu, makudonarudo, arubaito, saito- Revert to original spelling
May require context and interpretation, have ambiguity, and/or require diacritics.
e.g. personal computer, Christmas, McDonald, arbeit, site/sight
Small kana
- Small vowels
Treat them as big kana, except for well-known cases. e.g. まぁ romanized as maa.
- Use the pseudo-consonant “x”
e.g. xa, xyu, xtsu
Non-standard dakuten
Rarely used, usually for silly emphatic effect, a dakuten (゛) or han-dakuten (゜) is used on a kana that does not normally accept such a diacritic. For example, ま゛. I cannot think of any reasonable way to romanize in a situation like this, other than to ignore the (han-)dakuten and use the base kana.
Losslessness
Mathematically speaking, romanization can be thought of as a function that maps a sequence of kana letters to a sequence of Latin letters.
It’s easy to create a romanization scheme that is lossless (injective/
Here are some contrived edge cases to consider if your goal is to design a lossless romanization scheme:
Kana | Lossless romanization | Comment |
---|---|---|
っい | xtu i | No consonant to double |
あっな | a xtu na | Ambiguous consonant doubling |
てっっと | te xtu xtu to | Tests for consonant tripling |
しゃゃゅょさゃ | si xya xya xyu xyo sa xya | No good way to represent small kana after the first one |
くくうくううくううう | ku ku u ku u u ku u u u | Distinguishing more than two vowel lengths – fails macron and circumflex |
とおとう | to o to u | Fails vowel macron, circumflex, doubling, ou/oo as oh, and dropping |
ねえねい | ne e ne i | May fail vowel macron, circumflex, and doubling |
ニーニイニィ | NI - NI I NI XI | Fails any pronunciation-based scheme, because all three are pronounced the same |
にーにいにぃ | ni - ni i ni xi | Fails any pronunciation-based scheme, because all three are pronounced the same |
が゜お゛ | ga ° o " | Fails any scheme that ignores non-standard dakuten |