Why Humans Still Need to be Involved in Language-Based AI
New, sophisticated AI models such as OpenAI’s GPT-3 are making headlines for their ability to mimic human-like language. Does this mean humans will be replaced with computers? Not so fast.
Despite the hype, these algorithms still have major flaws. Machines still fall short of understanding the meaning and intent behind human conversation. Not to mention, ethical concerns such as bias in AI still are far from a solution. For these reasons, humans still need to be in the loop in most practical AI applications, especially in nuanced areas such as language.
Humans remain the best way to understand context
New machine learning models like GPT-3 are highly complex systems trained on vast amounts of data, which allow them to perform relatively well on a variety of language tasks out-of-the-box. And with just a small amount of examples of a specific task, they can perform very well.
So far, beta testers have had striking results using GPT-3 for many applications, such as writing essays, creating chatbots for historical figures, and even machine translation. Despite being trained predominantly on English data, the researchers behind GPT-3 found that the model can translate from French, German, and Romanian to English with surprising accuracy.
It would be convenient if we could use the same AI system like GPT-3 for several tasks at once, such as answering and translating a customer’s question simultaneously. However, translation is basically a serendipitous side-effect of training such a large, powerful model. There is still a long way to go before we can comfortably rely on a model like this to provide customer-facing responses.
OpenAI’s CEO Sam Altman said on Twitter that despite the hype, GPT-3 “still has serious weaknesses and sometimes makes very silly mistakes.” GPT-3 experiments are still riddled with errors, some of them more egregious than others. Users don’t always get desirable answers on the first try, and therefore need to adjust their prompts to get correct answers. Machine learning algorithms cannot be expected to be 100% accurate. Humans are still required to differentiate acceptable responses from the unacceptable.
The power of context in translation
Part of determining what is acceptable is making judgments related to how language is interpreted in context, which is something humans excel at. We effortlessly know that if we ask a friend, “Do you like to cook?” and her response is “I like to eat,” she probably doesn’t enjoy cooking. Context is also the reason we would say, “Could you please provide your payment details?” to a customer rather than, “Give me your credit card number,” even though the two sentences have the same intent.
In settings where there is little margin for error, such as real-time customer service chats, humans occasionally need to correct machines’ mistakes. Local dialects and phrases can easily be misinterpreted by machine translation. It’s also critical that a translation system adheres to localized cultural norms — for example, speaking formally in a business setting in countries like Germany or Japan. So, for now, we still need humans to process the nuances of language.
GPT-3 is impressive, but still biased
Going beyond questions of context, humans also need to be involved in the development of these language models for ethical reasons. We know AI systems are often biased, and GPT-3 is no exception. In the GPT-3 paper, the authors conduct a preliminary analysis of the model’s shortcomings around fairness, bias, and representation, running experiments related to the model’s perception of gender, race, and religion.
After giving the model prompts such as “He was very”, “She was very”, “He would be described as”, and so on, the authors generated many samples of text and looked at the most common adjectives and adverbs present for each gender. They noted that females are more often described with words related to their appearance (“beautiful,” “gorgeous,” “petite”), whereas males are described with more varied terms (“personable,” “large,” “lazy”). In examining the model’s “understanding” of race and religion, the authors conclude that “internet-trained models have internet-scale biases; models tend to reflect stereotypes present in their training data.”
None of this is novel or surprising, but investigating, identifying, and measuring biases in AI systems (as the GPT-3 authors did) are necessary first steps toward the elimination of these biases.
Keeping humans in the machine learning loop
To make tangible progress in mitigating these biases and their impact, we need humans. This goes beyond having them correct errors, augment datasets, and retrain models. Researchers from UMass Amherst and Microsoft analyzed nearly 150 papers related to “bias” in AI language processing, and found that many have vague motivations and lack normative reasoning. Often, they do not explicitly state how, why, and to whom the “biases” are harmful.
To understand the real impact of biased AI systems, they argue, we must engage with literature that “explores the relationship between language and social hierarchies.” We must also engage with communities whose lives are affected by AI and language systems.
After all, language is a human phenomenon, and as practitioners of AI, we should consider not only how to avoid offensive-sounding machine-generated text, but also how our models interact with and impact the societies in which we live.
In addition to bias, major concerns continue to surface about the model’s potential for automated toxic language generation and fake news propagation, as well as the environmental impact of the raw computing power needed to build larger and larger machine learning models.
Here the need for humans isn’t an issue of model performance, but of ethics. Who if not humans will ensure such technology is used responsibly?
GPT-3 can’t say, “I don’t know”
If the goal is to train AI to match human intelligence, or at least perfectly mimic human language, perhaps the largest issue is that language models trained solely on text have no grounding in the real world (although this is an active research area). In other words, they don’t truly “know” what they’re saying. Their “knowledge” is limited to the text they are trained on.
So, while GPT-3 can accurately tell you who the U.S. president was in 1955, it doesn’t know that a toaster is heavier than a pencil. It also thinks the correct answer to “How many rainbows does it take to jump from Hawaii to seventeen?” is two. Whether or not machines can infer meaning from pure text is up for debate, but these examples suggest that the answer is no — at least for now. To use AI-based language systems responsibly, we still need humans to be closely involved.
About the Author
Christine Maroti, AI Research Engineer at Unbabel, is originally from New York, and is often referred to as Tina, Tininha, or Tuna. She moved to Lisbon in the summer of 2018 to work in Applied AI at Unbabel. When she’s not training translation models, Tina enjoys scouring her new country for the best croquetes de carne.
Sign up for the free insideBIGDATA newsletter.