Start your day with intelligence. Get The OODA Daily Pulse.

Subscribe Sign In

Home > Analysis > “We Have No Moat”: Tracking the Exponential Growth of Open Source LLM Performance (Latency and Throughout)

“We Have No Moat”: Tracking the Exponential Growth of Open Source LLM Performance (Latency and Throughout)

Disruptive Technology

07/17/2023 | Written by: Daniel Pereira

Like the Unix, Solaris, Irix, AIX Open Source operating systems that became the Linux-led open source operating system and the broader open source software movement (over a 30 to 40 year time period) the open source large language model “market” is poised to eat up market share in an exponentially fast time frame.

Hugging Face has created an “Open LLM-Perf Leaderboard” for your organization to track, quantify and evaluate this exponential growth of LLM performance.

What are the Most Important Performance Metrics for Large Language Models for Enterprise Use? (by OpenAI’s ChatGPT)

When evaluating the performance of large language models for enterprise use, several important performance metrics should be considered. These metrics can help assess the quality, effectiveness, and suitability of a language model for specific applications. Here are some key performance metrics:

1. Perplexity: Perplexity measures how well a language model predicts a sequence of words. It indicates the model’s ability to assign high probabilities to the actual next words in a given context. Lower perplexity values indicate better performance.

2. Accuracy: Accuracy is a common metric for classification tasks. It measures the proportion of correctly classified instances compared to the total number of instances. It’s particularly relevant for tasks such as sentiment analysis, intent recognition, or text categorization.

3. F1 Score: The F1 score is a metric commonly used for tasks involving precision and recall, such as named entity recognition, part-of-speech tagging, or information extraction. It balances precision (the proportion of correctly identified positive instances) and recall (the proportion of actual positive instances correctly identified).

4. BLEU Score:The BLEU (Bilingual Evaluation Understudy) score is often used for evaluating machine translation systems. It measures the similarity between the generated translation and one or more reference translations, taking into account n-gram overlap and precision.

5. ROUGE Score: The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score is used for evaluating text summarization systems. It measures the overlap between the generated summary and one or more reference summaries, considering various measures like n-gram overlap and recall.

6. Latency and Throughput: For enterprise use, the speed and efficiency of language models are crucial. Latency measures the time it takes for the model to process a single input, while throughput indicates the number of inputs the model can handle in a given time frame.

7. Domain-Specific Metrics: Depending on the specific application, there may be additional metrics relevant to the domain. For example, in legal or medical applications, metrics related to precision, recall, or F1 score for specific entities or concepts might be important.

It’s important to note that the choice of performance metrics depends on the specific use case, and different metrics may be more relevant for different tasks. Evaluating a language model’s performance using a combination of these metrics can provide a comprehensive understanding of its capabilities and suitability for enterprise use.

What is Hugging Face? (by OpenAI’s ChatGPT)

Hugging Face is an organization that focuses on natural language processing (NLP) and is well-known for developing open-source tools and libraries for working with NLP models. Their most prominent contribution is the Transformers library, which provides a high-level API and pre-trained models for a wide range of NLP tasks.

The Hugging Face Transformers library allows researchers and developers to easily access and use state-of-the-art models such as BERT, GPT, and many others. These models have been pre-trained on massive amounts of text data and can be fine-tuned on specific downstream tasks, such as text classification, named entity recognition, question answering, machine translation, and more.

Hugging Face also provides a platform called the Hugging Face Hub, where users can upload, share, and download pre-trained models and datasets. This hub serves as a collaborative space for the NLP community, enabling users to discover and experiment with the latest models and datasets.

In addition, Hugging Face offers a range of other NLP-related tools and libraries, including tokenizers for various languages, pipelines for common NLP tasks, and visualization tools. They have made significant contributions to the NLP field by democratizing access to state-of-the-art models and fostering a vibrant community of researchers and developers.

The Hugging Face Open Large Language Model Leaderboard

What is the Hugging Face Open LLM Leaderboard?

The Open LLM-Perf Leaderboard aims to benchmark the performance (latency & throughput) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benchmark and Optimum flavors.

Anyone from the community can request a model or a hardware/backend/optimization configuration for automated benchmarking:

Model evaluation requests should be made in the Open LLM Leaderboard and will be added to the Open LLM-Perf Leaderboard 🏋️ automatically.
Hardware/Backend/Optimization performance requests should be made in the community discussions to assess their relevance and feasibility. (1)

First, note that the Open LLM Leaderboard is actually just a wrapper running the open-source benchmarking library Eleuther AI LM Evaluation Harness created by the EleutherAI non-profit AI research lab famous for creating The Pile and training GPT-J, GPT-Neo-X 20B, and Pythia. A team with serious credentials in the AI space!

This wrapper runs evaluations using the Eleuther AI harness on the spare cycles of Hugging Face’s compute cluster, and stores the results in a dataset on the hub that are then displayed on the leaderboard online space.

For the LLaMA models, the MMLU numbers obtained with the Eleuther AI LM Evaluation Harness significantly differ from the MMLU numbers reported in the LLaMa paper. (2)

What Next?

To get started with the the leaderboard, go to The Hugging Face Open LLM-Perf Leaderboard

A really actionable, intermediate/advanced discussion of the source code and build out of the leadershipboard by the Hugging Face team can be found at:
What’s going on with the Open LLM Leaderboard?

In a further illustrationthat the market uptake of Generative AI and LLM’s also has an “exponentially” to it – – for good or for ill? – the source and content of the following press release is ivery nteresting.

The context: BusinessWire, a Berkshire Hathaway company, leverages the Hugging Face Open LLM Leaderbroad in a “Good Housekeeping Seal of Approval” (i.e. “independent verification) kind of way fused with the role the reportage of the weekly” American Movie Box Office Revenues” used to play in American life – for the public relations campaign and the marketing of a United Arab Emirates-based LLM platform.

UAE’s Falcon 40B Dominates Leaderboard: Ranks #1 Globally in Latest Hugging Face Independent Verification of Open-source AI Models

ABU DHABI, United Arab Emirates–(BUSINESS WIRE)–Falcon 40B, the UAE’s first large-scale open-source, 40-billion-parameter AI model launched by Abu Dhabi’s Technology Innovation Institute (TII) last week, soared to the top spot on Hugging Face’s latest Open Large Language Model (LLM) Leaderboard. Hugging Face, an American company seeking to democratize artificial intelligence through open-source and open science, is considered the world’s definitive independent verifier of AI models.

Falcon 40B managed to beat back established models such as LLaMA from Meta (including its 65B model), StableLM from Stability AI, and RedPajama from Together to achieve the coveted ranking. The index utilizes four key benchmarks from the Eleuther AI Language Model Evaluation Harness, a consolidated framework that assesses generative language models on: the AI2 Reasoning Challenge (25-shot), a set of grade-school science questions; HellaSwag (10-shot), a test of common sense inference, which is easy for humans but challenging for SOTA models; MMLU (5-shot), a test to measure a text model’s multitask accuracy; and TruthfulQA (0-shot), a test to measure whether a language model is truthful in generating answers to questions.

Hugging Face’s Open LLM Leaderboard is an objective evaluation tool open to the AI community that tracks, ranks, and evaluates LLMs and chatbots as they are launched.

Trained on one trillion tokens, Falcon 40B marks a significant turning point for the UAE in its journey towards AI leadership, enabling widespread access to the model’s weights for both research and commercial utilization. The new ranking confirms the model’s prowess in making AI more transparent, inclusive, and accessible for the greater good of humanity.

With this latest development, TII has managed to secure the UAE a seat at the table when it comes to generative AI models, allowing it to join an exclusive list of countries that are working to drive AI innovation and collaboration.

TII has already embarked work on its next version of Falcon – the 180B AI model. To learn more about the current open sourced Falcon 40B AI model, please visit: FalconLLM.TII.ae. The initial announcement on Falcon 40B can be found here: UAE’s Technology Innovation Institute Launches Open-Source “Falcon 40B” Large Language Model for Research & Commercial Utilization.

For more information, visit www.tii.ae *Source: AETOSWire

https://oodaloop.com/archive/2023/05/05/we-have-no-moat-and-neither-does-openai-leaked-google-document-breaks-down-the-exponential-future-of-open-source-llms/

Tagged: LLMs

About the Author

Daniel Pereira

Daniel Pereira is research director at OODA. He is a foresight strategist, creative technologist, and an information communication technology (ICT) and digital media researcher with 20+ years of experience directing public/private partnerships and strategic innovation initiatives.