AI Research & NLP

Pure Lexicon AI

An NLP research system that translates English loanwords into native Japanese vocabulary — using only existing Japanese words, never borrowings. No katakana allowed.

Python PyTorch Sentence-Transformers FAISS HuggingFace NLP

Overview

Pure Lexicon AI addresses a deeply cultural problem: as Japanese absorbs more English loanwords (カタカナ語, or katakana-go), parts of the native lexicon risk atrophy. What if words like "computer" could be expressed purely through native Japanese roots — the way the Meiji reformers coined 電話 (telephone) from 電 (electric) + 話 (speech)?

This system automates that process: given an English loanword, the AI finds the best native Japanese word combination (up to 3 words) that semantically approximates its meaning, scored by cosine distance in a shared semantic embedding space.

Technical Architecture

The system is built on three core pillars:

  • Semantic Embedding Mapping — Sentence-Transformers embed both the English input and the entire native Japanese vocabulary into the same high-dimensional vector space
  • Vocabulary Filtering — We filter out katakana-derived words from the candidate pool, keeping only native yamato kotoba and Sino-Japanese kango
  • Combinatorial Optimization — FAISS-powered vector similarity search finds single-word candidates, then greedy search evaluates 2-word and 3-word combinations scored by cosine proximity + brevity preference

The system prioritizes shorter combinations (1 word > 2 words > 3 words) to produce natural-sounding neologisms rather than awkward compound strings.

Research & Results

Initial tests on a vocabulary of ~18,000 native Japanese words produced some fascinating outputs. The system's suggestions often echoed historical coinage patterns — reinforcing the idea that semantic space captures something genuine about conceptual structure.

A secondary LLM-based verification module was prototyped to filter morphologically unnatural combinations and rank outputs by grammatical fluency.

This project is ongoing research — we are exploring fine-tuning on classical Japanese texts to improve the semantic embedding space for archaic vocabulary coverage.