Natural Language Processing, AI and Large Language Models: A Brief Recent History
For better or for worse, Large Language Models (LLMs) – used for natural language processing by commercial AI Platform-as-a-Service (PaaS) subscription offerings – have become one of the first “big data” applied technologies to become a crossover hit in the AI marketplace.
From a big data perspective, LLMs are gigantic datasets or data models. In the world of AI, LLM’s are huge neural networks that increase in size based on the number of parameters included in the model and are used by neural networks for training. Neural network parameters are values constantly refined while training an AI model, resulting in AI-based predictions. The more parameters, the more the data training results in structured information (organized around the parameters of the LLM) – enhancing the accuracy of the predictions generated by the model.
In April of 2020, the bleeding edge of innovation in this space was the Facebook chatbot Blender, made open source by Facebook with 9.4 billion parameters and an innovative structure for training on 1.5 billion publicly available Reddit conversations – with additional conversational language datasets for conversations that contained some kind of emotion; information-dense conversations; and conversations between people with distinct personas. Blender’s 9.4 billion parameters dwarfed Google’s Meena (released in January 2020) by almost 4X. (1)
OpenAI, a San Francisco-based research and deployment company, released GPT-3 in June of 2020 – and the results were instantly compelling: Natural language processing (NLP) with a seeming mastery of language that generated sensible sentences and was able to converse with humans via chatbots. By 2021, the MIT Technology Review was proclaiming OpenAI’s GPT-3 a top 10 breakthrough technology and “a big step toward AI that can understand and interact with the human world.”
Even hard-core coders were impressed. 3D computer graphics pioneer John Carmack (Doom; currently consulting CTO at Oculus VR) was a bit shaken up by the GTP-3 results (see above). Mr. Carmack is in luck, as DeepMind also recently announced the competitive performance results of their standalone AI NLP coding product, the new AlphaCode system.
“Internet-trained models have internet-scale biases.”
As Will Douglas Heaven reported in 2020, “OpenAI’s new language generator GPT-3 is shockingly good—and completely mindless. The AI is the largest language model ever created and can generate amazing human-like text on demand but won’t bring us closer to true intelligence.”
A vital human information processing step – information becoming insight or, heaven forbid, wisdom – continues to be a shortcoming of not only OpenAI’s GPT-3 but natural language processing generally. Final outcomes and interpretations of the meaning of the language generated by these natural language processing systems are at times poor. Ironically, no matter how large or parameter-stuffed, an LLM remains at the mercy of the biases embedded in and the quality of the language sourced for training (which is usually aggregate text from the internet). As researchers noted as early as 2020: “Internet-trained models have internet-scale biases.”
Artificial Intelligence Natural Language Processing Platforms: The Class of 2021
GPT-3 has 175 billion parameters, which has 100 times more than its predecessor, GPT-2, which had 1.75 billion parameters. “We thought we needed a new idea, but we got there just by scale,” said Jared Kaplan, a researcher at OpenAI, on a panel discussion at NeurIPS, a leading AI conference.
GPT-3 is a PaaS, subscription-based offering – essentially an API delivery platform via the cloud. This business model solves the pain point of the cost and complexity of an in-house build of an AI-based NLP platform of equal power. The downside is that it reserves AI NLP development to global technology behemoths, adequately financed startups, and government labs with vast computational power and resources: “…large models also take vast amounts of computing power to train, putting them out of reach of all but the richest organizations.” (2) GPT-3, for example, cost $12 million to train, 200X the price of the development of GPT-2.
According to the MIT Technology Review: “…training a model with more than 100 billion parameters is a complex plumbing problem: hundreds of individual GPUs—the hardware of choice for training deep neural networks—must be connected and synchronized, and the training data split must be into chunks and distributed between them in the right order at the right time. Large language models have become prestige projects that showcase a company’s technical prowess. Yet few of these new models move the research forward beyond repeating the demonstration that scaling up gets good results.” (3) Energy consumption will also be an issue for the entire nascent AI NLP marketplace.
Since GPT-3’s release in 2020, platforms introduced in 2021 have begun to make GPT-3 look like chump change:
- Beijing Academy of AI’s Wu Dao 2.0: A 1.75 trillion parameters model.
- Google’s Switch-Transformer and GLaM models: 1 trillion and 1.2 trillion parameters, respectively.
- Microsoft and NVidia, Megatron-Turing Natural Language Generation model (MT-NLG): Introduced in October 2021 with 530 billion parameters. “We continue to see hyper scaling of AI models leading to better performance, with seemingly no end in sight,” wrote Microsoft researchers at the time the model was made available. MT-NLG is the successor to Turing NLG 17B and Megatron-LM.
- Google/Deep Mind’s Gopher: Released in December 2021 with 280 billion parameters.
- Baidu and Peng Cheng Laboratory – PCL-BAIDU Wenxin: A model with 280 billion parameters. Baidu uses the model for internet search, news feeds, and smart speakers.
- China’s Inspur’s Yuan 1.0: A 245-billion-parameter model.
- China’s Huawei’s PanGu: A 200-billion-parameter language model.
- U.S. startup AI21 Labs Jurassic-1: A commercially available large language model launched in September by AI21 Labs with 178 billion parameters.
- South Korean internet search firm Naver – HyperCLOVA: A 204 billion parameter model. (4)
What Next?
- InstructGPT: In January, a new build of GPT-3, InstructGPT, was made available by OpenAI. The company claims that some of the problems with the interpretation of language and final results have been solved in this new build: “The San Francisco-based lab says the updated model, is better at following the instructions of people using it—known as ‘alignment’ in AI jargon—and thus produces less offensive language, less misinformation, and fewer mistakes overall—unless explicitly told not to do so.” (5)
- Deepmind’s Innovative “Retrieval-Enhanced Transformer” (RETRO) LLM Architecture: Released in December 0f 2021 (and technically a member of the class of 2021) we reserved mention of Deepmind’s RETRO because we did not want its paltry neural network (with only 7 billion parameters) to be misinterpreted. The smaller size of the RETRO LLM is the core innovation and is based on a data architecture innovation that makes RETRO a development to watch. The 7 billion parameters of the RETRO neural network also reference a separate information database (that is training at the same time) allowing for models which are less than 5% the size of GPT-3 with performance on par with the 175 billion parameters of GPT-3.
- RETRO May Fuel Exponential Disruption in Healthcare: Take a look at this blog post for a breakdown of how the ability to dynamically update the information database of the RETRO LLM data architecture may contribute to the exponential innovation S-curve of AI innovation in healthcare.
- Is Meta, Inc. Swatting a Fly with an Elephant Gun?: It is to be seen how Meta, Inc. leverages the new AI Supercomputer it unveiled in January “that the company maintains will soon be the fastest in the world. The supercomputer, the AI Research SuperCluster, was the result of nearly two years of work, often conducted remotely during the height of the pandemic, and led by the Facebook parent’s AI and infrastructure teams. Meta…said its research team currently is using the supercomputer to train AI models in natural-language processing and computer vision for research. The aim is to boost capabilities to one day train models with more than a trillion parameters on data sets as large as an exabyte, which is roughly equivalent to 36,000 years of high-quality video.”
- Chip innovation (with plummeting price and soaring performance) is not out of the question: GPT-3 requires roughly 20 GPUs for inference alone. Watch this space for updates on what is proving a dynamic chip manufacturing and innovation climate stateside due to geopolitical pressures. We should be able to revisit the dependence on GPU computational power of the AI LLM market in the next 6 to 18 months.
- Large Datasets as THE Design Paradigm for AI-based NLP – Don’t Believe the Hype: OODA Network Member Florian Wolf recently introduced the OODA Loop community to small data as an alternative to some of the limitations and pitfalls of dependence on large datasets as a cure-all for, well, everything (as the market implicitly suggests most of the time). Florian Wolf expanded on recent research by the Georgetown Center for Security and Emerging Technology (CSET), operationalized it, and shared with OODA network members the real capabilities and real-world use cases of small data.
- Small Data vs. Big Data. What is Small Data? Small data are technologies that allow machines to learn from less amount or fewer data points – which cuts down on overall chip demand and power consumption – and maybe a strategic opportunity for solving compute power and scarcity issues in the event of a severe chip shortage. Small data means “the ability of machine learning or AI systems to learn from small training data sets. Florian noted that “It is important to have some idea of what the distribution across your growth is because if you make a mistake there, you are synthesizing data along the way. I think as long as you have some idea of what the distribution looks like, it’s really interesting.” Florian is busy running a company, but we will be closing the loop (no pun intended) soon on a co-authored post from Florian with his working definition of small data – along with some framing of small data architectures and methodologies. We will update our research to include the implications for AI LLMs.
- For more, explore OODA Research and Analysis by searching ‘small data’ or ‘TinyML’: Recent OODA Loop News Briefs on related topics include Why optimizing machine learning models is important, TinyML is bringing deep learning models to microcontrollers and Why optimizing machine learning models is important.
- New Data Approaches: It is important that the innovation market not calcify too early into an” LLM model only” for AI-based breakthroughs in performance. The OODA Network recently brainstormed about new data approaches, which are included in the November 2021 OODA Network Member Meeting notes. Ideas discussed were the framing and contextualization of data, ideas for policymakers, data boundaries – more creative use cases, and robotics and data automation. One question arose which applies to our conversation here: Are these technologies used in this market technically automation or data acquisition?
- The Ongoing Cybersecurity Crisis – Security Will Matter: Do not make any assumptions about the security of your source data and/or the storage of your uniquely trained LLM sitting on a cloud-based server in a PaaS model. Put some thought into the strategic risk faced by your data storage architecture. The current fate of Web3 marketplaces and exchanges may hold some tactical security lessons for us here. An AI NLP middleware ecosystem (based on the same Web 2.0 technologies that are being deployed without proper consideration for cybersecurity best practices) would replicate some of the cybersecurity problems DeFi and crypto marketplaces and exchanges have experienced recently (in the form of brash, frequent, and costly cryptocurrency heists in 2021, which have continued in earnest into 2022). What are the probability and impact of the unauthorized retrieval, damage, public release, ransoming, and/or loss of proprietary AI LLMs, source data repositories, and/or the uniquely designed parameter datasets and models? To say nothing of a breach of the outcomes: proprietary, AI-based predictive datasets and models?
- The BigScience initiative: A consortium led by AI company Hugging Face, made up of roughly 500 researchers (some of whom are from big tech firms). The group of volunteer researchers will be building and studying an open-source language model.
- Further Risk Awareness: For a compelling read about a Google researcher fired for pointing out the potential unintended consequences and ethical risk of LLMs, see The paper that forced Timnit Gebru out of Google.
- Large Scale System Design Case Studies are Instructional: In an effort to integrate more systems and design thinking into our research efforts, we recently provided case studies and analysis of a few behemoth software and hardware development projects, which may be instructional to review. Again, the conceit here is that all big data projects do not need to default to commercial cloud architectures and data design paradigms based solely on market trends – especially where data security is concerned. Case studies include:
Related Reading:
Black Swans and Gray Rhinos
Now more than ever, organizations need to apply rigorous thought to business risks and opportunities. In doing so it is useful to understand the concepts embodied in the terms Black Swan and Gray Rhino. See: Potential Future Opportunities, Risks and Mitigation Strategies in the Age of Continuous Crisis
Explore OODA Research and Analysis
Use OODA Loop to improve your decision-making in any competitive endeavor. Explore OODA Loop
Decision Intelligence
The greatest determinant of your success will be the quality of your decisions. We examine frameworks for understanding and reducing risk while enabling opportunities. Topics include Black Swans, Gray Rhinos, Foresight, Strategy, Stratigames, Business Intelligence, and Intelligent Enterprises. Leadership in the modern age is also a key topic in this domain. Explore Decision Intelligence
Disruptive/Exponential Technology
We track the rapidly changing world of technology with a focus on what leaders need to know to improve decision-making. The future of tech is being created now and we provide insights that enable optimized action based on the future of tech. We provide deep insights into Artificial Intelligence, Machine Learning, Cloud Computing, Quantum Computing, Security Technology, Space Technology. Explore Disruptive/Exponential Tech
Security and Resiliency
Security and resiliency topics include geopolitical and cyber risk, cyber conflict, cyber diplomacy, cybersecurity, nation-state conflict, non-nation state conflict, global health, international crime, supply chain, and terrorism. Explore Security and Resiliency
Community
The OODA community includes a broad group of decision-makers, analysts, entrepreneurs, government leaders, and tech creators. Interact with and learn from your peers via online monthly meetings, OODA Salons, the OODAcast, in-person conferences, and an online forum. For the most sensitive discussions interact with executive leaders via a closed Wickr channel. The community also has access to a member-only video library. Explore The OODA Community
About the Author
Daniel Pereira
Daniel Pereira is research director at OODA. He is a foresight strategist, creative technologist, and an information communication technology (ICT) and digital media researcher with 20+ years of experience directing public/private partnerships and strategic innovation initiatives.
Subscribe to OODA Daily Pulse
The OODA Daily Pulse Report provides a detailed summary of the top cybersecurity, technology, and global risk stories of the day.