While AI bias has most often been the systemic bias large language models sometimes display against different genders and races, it is also becoming clear that models can be biased by a preference for one language over another.
In recent years, efforts have been made to curb this preference, with AI model developers such as Google and OpenAI creating translation models. Most recently, Google released TranslateGemma on Jan. 15, which was trained in 55 languages and 500 language pairs — languages that can easily be translated from one to the other.
However, translation models fail to capture some of the nuances of spoken language. Enterprise AI platform vendor Articul8 says its LLM-IQ agent provides more insight into this. The multi-tiered evaluation agentic system scores models on five qualitative dimensions: fluency and naturalness, coherence, cultural norms, consistency and clarity.
With the framework, Articul8 found that many models failed on cultural appropriateness, suggesting that more work is needed for AI technology to be ready on a global scale.
In this Q&A, Articul8 CEO and founder Arun Subramaniyan discusses what led to the development of the framework and why having a culturally appropriate model is essential.
What led Articul8 to develop the LLM-IQ agent, and why did it focus on the nuances of translation in AI models?
Arun Subramaniyan: We have customers in Japan and in Korea. As we started deploying into those regions, we needed models that actually understood multiple languages.
One thing that happened was that when we deployed some of our systems early on, the customer was both happy and unhappy.
In Japan and Korea, they told us, “Your answer is accurate, but it’s rude.”
We said, ‘Okay,’ but we didn’t know the difference.
It so turns out that Japanese has multiple layers of complexity. A lot of languages have that. For example, in English, you is only you. It’s neither respectful nor disrespectful. Whereas in many languages there is a ‘you’ for people you’re on the same level with, if you’re addressing elders, seniors, or someone with respect, it’s a different word. And those nuances are sometimes picked up, but most of the time they are not.
But in Japanese, there is one more level where the context of what you are saying, like who you are saying it to, who is saying it, and to get what outcome from that conversation. You can be direct, indirect, polite, overly polite, or slightly harsh. Depending on the context, if you use the wrong, say, intonation, that’s also considered wrong.
That’s really what intrigued us, because this is more at the linguistic level. Even though it is not a technical domain, it is a domain-specific language for Japanese.
After we did more research, we found it to be very systematic. All of the models were built predominantly with English or Latin-like languages, and even those from China missed this nuance completely. Their representation of Japanese in terms of digital content might be higher. However, they were not trained to catch these nuances.
In what situations would it matter whether an LLM is polite or impolite?
Subramaniyan: For example, in a supply chain, you don’t know whether somebody was giving a recommendation or whether somebody gave a directive that will have profound implications.
Also, it might have serious costs.
If you have an automotive system, it is generating a recommendation. The human in the loop is reading the recommendation. The human doesn’t know whether the recommendation needs to be acted on with 100% certainty. That has profound implications in an industrial setting.
With the rise of sovereign AI, with more regional AI vendors addressing local issues with their own technology, why should a vendor outside a country like Japan be the one to deal with the language problem?
Subramaniyan: I see this more as someone with global insight versus somebody with only local insight. You need to be locally enabled, but globally optimistic.
It’s about global learning applied immediately in Japan, with localization that is uniquely Japanese. It’s very different because, yes, you know more about the localization instantly, but imagine having to operate globally with all the data you need to do what you need to do.
For example, our energy models are based on global datasets. Our local partnerships are based on manufacturing models from global partnerships. Our research partnership with Meta, our scaling partnership with AWS, all of those come because we are a global operator. But we also operate with the deep understanding that even though we are global, we have to customize what we do.
Why do you think the LLMs appear unable to catch the nuance of a language like Japanese?
Subramaniyan: The biggest fault is that all of the data sets are extremely biased. What I mean by ‘biased’ is an asymmetric distribution of English versus non-English. Even in Latin languages or Latin-based languages, the distribution is asymmetric: I’m talking 99% to 1%. It’s not like a slight difference.
Even digitized non-English content comes primarily from the West or from sources we don’t have access to, such as China.
All that politeness, what is considered polite and impolite, what is considered near-natural human interaction came from the West.
In developing this framework, was there a particular open source model that worked better than proprietary models?
Subramaniyan: We benchmarked against all open source models and all closed source models. But then we had to build these models from the ground up because we had to balance the data set. If you don’t balance the data set, you’re going to constantly keep having the same bias.
We have a concept called Model Mesh, which enables us to orchestrate and decide at runtime which models to call for what. We don’t necessarily need a large, general-purpose model that has to be fine-tuned for every task. We can have task-specific models that are independent and then make them work together as a system. Then the system is a runtime reasoning engine that we can run together.
Yes, we do use general-purpose models to acquire information about the world. But then, when it comes to Japan and the Japanese language, we have our own model.
The other question on people’s minds would be, ‘Oh my god, like do I need to build massive models for every single task?’
The answer is no, because we end up with a family of models that grow together. If a model does one task really, really well, that somehow influences and improves across the board.
Editor’s note: This interview has been edited for clarity and conciseness



