Japanese LLMs and the challenges of a multilingual database

10/9/2023

Japanese LLMs and the challenges of a multilingual database

A literal translation of *shitsureshimasu,* which would be better in English as “Sorry for bothering you.”

One of the most interesting things in IT are semantic technologies, or helping machines “understand” data. Computers don’t reason or infer like humans do, but we’re quite clever at designing systems and products – like LLMs – that behave as if they do.

I’ve just begun a year-long Japanese language course at the University of London, which is scratching my itch to be a student again. As a result, I’m thinking a lot about language. Inevitably, it also means I’m thinking a lot about meaning. And information.

I’m in product development from a liberal arts background where a search for meaning is pursued in some fashion by most fields. For social scientists, this search for meaning has created numerous ways in which their methods participate in the industry and improve products. (Jack Dorsey said he’d redesign Twitter with a social scientist if he could!)

Last month, the Tokyo Metropolitan Government announced it would begin officially allowing its 50,000 employees to trial ChatGPT at work. This follows an increasing number of prefectural governments officially introducing or trialing generative AI. While flipping through my Japanese-English dictionary for class, it occurred to me that the ways in which social science can influence backend systems for NLP tools was less clear to me.

Knowledge representation must vary with different cultural norms, language structures, and cognitive styles, right? How does knowledge representation need to change for different languages, such as English and Japanese? How would database design be influenced by an anthropologist on the team?

Ontologies and meaning making

Information science handles data. Data isn’t always explicit; it can be qualitative, unreliable, or incomplete. Ontologies are an interesting tool not just in representing knowledge but in actively uncovering implicit information.

Unlike algorithms, ontologies map data to a controlled vocabulary with regard to semantics and relationships. In doing so, new knowledge is made. This new knowledge is sourced from new insights as a result of this organizational method, but also pattern recognition or the discovery of previously unknown (or unnoticed) dependencies or relationships between entities or ideas.

For example, a film might be recommended to a user based on their favorite books. This recommendation might come from any number of relationships identified between each piece of media, such as themes. An ontology can also be used to structure and fill out metadata associated with an item for a consumer product in e-commerce. When combined with other technologies (like algorithms) contextual recommendations can feel ridiculously personalized. We feel, unexpectedly, understood by a computer.

In a multilingual context, challenge of no direct translation of things which can create gaps if left un-indexed. Semantic mismatches between languages can make nuance a struggle too. For example, “kawaii” as an attribute in English refers to a very specific pastel, childlike aesthetic; in Japanese, “kawaii” is a common adjective for a wide variety of styles, objects, behaviors, or animals.

Cultural-linguistic challenges of knowledge representation

Database design is primarily interested in structured data (and ensuring integrity and efficiency), a storage system and its management. A central challenge for natural language AI products is that they pull on unstructured data (like crawling text from the internet) which then needs to be stored and processed into a database – essentially made structured.

At risk of over generalizing, American culture as expressed in the English language tends to represent knowledge in explicit and linear ways. American culture emphasizes hierarchical structures.

Japanese culture – thus language – is different. Proposals are circulated to all stakeholders and departments for feedback and endorsement. Input and perspectives are sought from the group. In a culture that typically prefers consensus-driven decision making, there’s a preference for representing knowledge as context-dependent, and there’s often an invitation for multiple perspectives to be shared, heard, or noted. Rather than hierarchical thinking, holism tends to drive information design.

There’s a great video by Cynthia Zhou on how this affects Japanese UX design:

But to what extent does language itself impact knowledge representation? And can it be quantified?

Representing knowledge in English vs Japanese

Comparative studies of cognitive processes and other psycholinguistic experiments have taken on this challenge. But I’m curious about how this affects LLMs and their design, as knowledge representation works to help computers understand, manipulate, and reason with information in order for it to provide meaningful responses to a human-provided input (trendily called prompt engineering). This requires the semantics and structure of knowledge (language) to be captured and modeled.

The most obvious difference is the literal linguistic representation of a thing and associated language encoding. For instance, this thing:

In English, it’s a cloud. In Japanese, it’s kumo. Pretty straightforward. But let’s look at another example. This (these) thing(s):

In English, we could say “cat” and “cats,” or perhaps “cat” and “three cats.” In Japanese, which lacks plural forms of nouns, both things are “neko.” If it felt important to the speaker (or cataloger) to specify quantity, they might add sambiki (a compound word meaning three small animals) before neko (cat/s) – or they might not. This presents some challenges for tokenization.

Which leads to another, more conceptual, difference when representing knowledge in different languages. Depending on the domain, different levels of detail may or may not be required. Different cultures place value in or emphasis on different attributes or ideas (classical Japanese seasons come to mind). This also would vary depending on the system being designed, like an LLM designed for medical or health services.

But even the mundane should be considered. A simple example is the traffic light. In English, we refer to the go-color as “green.” In Japanese, it’s referred to as blue, “aoi.”

Loanwords need similar considerations. A “mansion” in English is a large house for a single, likely very wealthy, individual or family. In Japanese, “mansion” refers to a large building that houses many families or individuals – what an English speaker might call an apartment building. But apartment, or “apaato” in Japanese, means something different. On the reverse, in English we use the Japanese word “sensei” to refer to martial arts instructors. In Japan, it’s used for all teachers, as well as for doctors.

The goal of knowledge representation is to create machine-readable knowledge regardless of language. But multilingual AI systems require these linguistic and cultural considerations in their design and implementation.

I wanted to learn more about how ChatGPT has been received in Japan, and how effective the Japanese language version of the chatbot is. Is it revolutionizing life and work as it is in English speaking countries?

The chatbot doesn’t speak Japanese (at least not very well)

Turns out, LLMs in Japanese are a challenge.

Keisuke Sakaguchi, an NLP researcher at Tohoku University, told Nature that tools like ChatGPT struggle in Japanese due to differences in linguistic representation (alphabet system) and limited data. Japanese contains over 2,000 kanji – with at least two pronunciations each depending on context – in addition to two 48-character syllabic characters. He notes, “Current public LLMs…often fall short in Japanese.”

To be useful and commercially viable, an LLM should reflect culture alongside language. For instance, if an LLM generates text in Japanese, it might lack standard expressions of politeness and appear as a mere translation from English.

That doesn’t stop chatbots from being useful, though. Earlier this year, Reuters reported that OpenAI’s third-largest source of traffic was Japan. Demand for this technology persists despite the shortcomings of what’s available in Japanese.

To overcome the limitations of English-centric LLMs like ChatGPT, Japan is developing its own AI chatbot systems. Academics, the government, and the commercial sector are all investing in various projects in this effort. The Rakuda Ranking, developed by a group of researchers, was built for assessing these developments by testing how well an LLMs can respond to questions on Japanese topics. They’ve shown that, so far, Japanese LLMs are improving but are still behind GPT-4.

Current LLM projects in Japan include Japan’s Ministry of Education, Culture, Sports, Science and Technology funding of an AI program for scientific research. The supercomputer Fugaku is being trained on Japanese-language input from a collaboration between the Tokyo Institute of Technology, Tohoku University, Fujitsu, and RIKEN. The LLM is expected to be open source when released to the public. Then there are companies like NEC and SoftBank who are commercializing their own LLM technologies. NEC’s generative AI is enhancing productivity by assisting with tasks, while SoftBank plans to launch its LLM for a wider range of applications.

Culturally sensitive, multilingual AI chatbots

Japanese researchers believe that a high-quality AI chatbot that’s sensitive to Japanese culture could accelerate scientific research. They also believe it could be helpful for international collaboration by helping people to learn Japanese or conduct research on Japan.

Language is intertwined with culture. The design of a database for a large language model, particularly in the context of multilingualism and cultural diversity, presents a challenge on multiple fronts. It’s not OpenAI’s fault that the public internet, from where it gets its data, is mostly in English. I don’t think ChatGPT needs to be fundamentally redesigned or retrained in support of Japanese, or any other language.

But if a database or AI system is being built with specific intentions for multilingual use, then the database should consider cultural nuances and sensitivities in data representation and content. Vocabulary differences, like I discussed earlier, need to be accounted for. But so does multilingual tokenization, translation variations, cross-language search, and language dependent features.

The full potential of these models is going to continuously evolve as we utilize them. Perhaps in a few years we’ll see a product that successfully integrates training data and database designs from a successful American LLM and Japanese LLM, fostering that meaningful cross-cultural collaboration that some Japanese researchers are hoping for.

Bree Fabig

artificial intelligence, metadata

database design, japanese AI, Japanese chatbots, Japanese language, japanese LLMs, knowledge representation, large language models, multilingual data