Start your day with intelligence. Get The OODA Daily Pulse.

Home > Analysis > OODA Original > What Leaders Need to Know About the State of Natural Language Processing (NLP)

What Leaders Need to Know About the State of Natural Language Processing (NLP)

This post seeks to illuminate major developments in computer language understanding in a way that can help enterprise and government leaders better prepare to take action on these incredible new capabilities.

The Big Deal

Major improvements in the ability of computers to understand what humans write, say and search are being fielded. These improvements are significant, and will end up changing just about every industry in the world. But at this point they are getting little notice outside a narrow segment of experts.

The History of Computer Language Processing

The potential power of computers that could understand natural language (the discipline of Natural Language Processing, or NLP) has been a formal goal pursued with vigor for over 60 years. From the 1950’s to 1990’s methods were primarily hard coding meaning and relationships into software. There were significant gains during these decades but it wasn’t until the late 1990’s that powerful statistical methods of understanding language enabled computers to get a basic gist of the content of a document. Statistical methods might count the number of words and then infer some meaning on the document. For example, if a 1000 word document had 20 mentions of the world “dog” it probably has something to do with dogs. The rise of search engines resulted in significant investment into new methods of understanding text, but most all were statistics based. Even with the fielding of new Machine Learning approaches, fundamentally the computers were using methods that were based largely on counting numbers of words and phrases and counting numbers of links or other statistical methods. Major search engines like Google and enterprise search capabilities like Lucene and Elasticsearch leveraged these and other techniques to imply meaning and relevance. There were continued breakthroughs and improvements but they progressed slowly. By 2010, the most advanced of these approaches reached the point where evaluating them against benchmarks showed that they could not really understand information well enough to answer questions over standardized data. Even when finely tuned and well trained, the best systems only operated at the level of a third grader in benchmark question and answer tests.

By 2018, just 8 years later, natural language processing capabilities were outperforming humans on benchmarks. This is proof of change.

Breakthroughs in Computer Language Understanding

There were three major advancements in algorithms and methods of analyzing text that led to computers being able to exceed the ability of humans in comprehending text:

  • In 2013, a new method of analyzing data called word vectors was announced. This allowed mapping of the meaning of text in relation to other text based on how similar it is. For example, car is closer to truck than it is to penguin, and this can be mapped in a multi dimensional space including all words and all phrases ever written. This method can run over data “unsupervised” meaning it does not require humans to guide it. This makes it very scalable. It has been proven to work over some of the world’s largest datasets and is already delivering capability in several of Google’s key search tools.
  • In 2017, a new method of understanding a full phrase of text was introduced, which leveraged the very latest machine learning capabilities. This method, called Transformer Networks, allowed for better understanding based on the sequential nature of language. So now, computers can understand, on their own, the difference between: “The President Briefs NATO” and “NATO Briefs The President”. Both of those phrases are grammatically correct, but both have different meaning. Transformers enable computers to see this on their own vice being guided by a human or pre-programed to do so.
  • In 2018, research published by Google called BERT, introduced the idea of pre-training language models in a way that helps computers comprehend text closer to ways humans do. It enables computers to understand the essence of information even if they don’t grasp all the details. BERT and subsequent models have been trained over massive quantities of data in every language available and by doing so at scale is now a capability that not only can more easily grasp the meaning of text in any language. It has also learned the pragmatics of language, realizing which meanings are more likely in use. After its training over massive quantities of information, BERT can now be finetuned and used on much smaller scales of data with similar results, meaning it has enormous potential for improving the use of data by organizations of all sizes.

There are still technical barriers for implementing these types of solutions outside of large ecommerce and cloud providers. But we have seen innovation in the startup world, including Haystack and ZIR Semantic Search, that will soon be bringing these types of capabilities to market for any organization small or large (my friend, famed innovator and creator Amr Awadallah, just joined ZIR as CEO, which can be taken as an indicator of exciting things happening in this space).

All of this means the time is now for organizations outside of Silicon Valley to start thinking about how to implement solutions that embody these capabilities.

Potential Impact: A discussion of some use cases for improved NLP

Here are a few ways improved computer language understanding can outcomes:

Science in All Disciplines: Researchers will be able to leverage all results of all research, no matter the language, using conversational queries that return results for any relevant information even if previous reports used synonyms vice exactly the same taxonomy of terms. Duplicate research can be minimized. Forward progress improved.

Health and Healthcare: For the first time in history functionally interoperable patient medical records will be possible without the huge arguments about formats, taxonomies and ontologies for capturing patient medical information. And patients who desire to research their own health or nutrition can access information that is not biased towards spammy SEO tactics or paid placement.

Supply Chain: Global businesses rely on complex supply chains that can span multiple countries involving multiple companies with data and information in multiple languages. New computer language capabilities will allow any authorized user to access any information from across the entire supply chain, no matter what initial language it is in and present it in the language of the user, optimizing outcomes.

Ecommerce: Product reviews can be part of a buyer’s decision but currently they are so sub optimized because of poor search capabilities and in many cases overwhelmed by spam that they are not adding the value to customers or sellers that they should. New natural language capabilities will enable customers to search reviews in any language that deliver results that are relevant even if synonymous phrases are used in the reviews, making market research easier.

Justice and Law: Lawyers are expected to know every law that has ever been written in every jurisdiction. Obviously an impossible task, one that is addressed by hiring more lawyers and spending more time on research. New natural language capabilities will empower lawyers to research every law in every language by simple conversational dialog. In most major cases, especially those involving business, lawyers accumulate and must search through large amounts of data in a process they call discovery. This can include large troves of corporate data, but also witness statements, medical reports and many other records, in multiple languages. New natural language search will empower them greatly.

Government Service to Citizens: Citizens seeking to interact with government at local, state and federal levels across all branches of government, face a wide range of interfaces and display of search tools that are at this point all sub optimized and time consuming to generate results. By modernizing search, discovery and information retrieval, governments at all levels will be able to better serve citizens with the information they need.

Diplomacy: Diplomats from open societies seek to understand the world and an ability to search, using advanced methods, all the information in the world no matter what language it is in and receive results based on relevance vice antiquated statistical methods will contribute to this.

Defense: The US DoD is the largest single enterprise in the world with information search and retrieval needs spanning every conceivable use case, including HR, employee management, workforce training, retention, finance, acquisition, research, design, operations, strategy, doctrine and planning. These and every other use case for information retrieval will benefit from modernization to enable truly optimized search and retrieval.

Personal Information Management: The last decade has seen an explosion of personal information on and by users, as well as the development of a new field of personal note-taking and research tools designed to enable the capturing of insights for personal use. None of these have more than a very basic search functionality and none are comprehensive over all personal holdings. New computer language capabilities will enable optimization of personal information retrieval and analysis.

All Businesses: What business does not have a failed enterprise search technology? All businesses are in need of a system that empowers administration, HR, legal, operations, business intelligence and frontline knowledge workers with better information retrieval that spans all internal holdings in all languages. Doing this before the competition will be a competitive advantage.

Here are some initial suggestions for action today

  • Leaders in commercial organizations should engage your leadership team and workforce to get an understanding of how internal enterprise search and information retrieval is currently working (or not working as the case may be). You probably have many anecdotal examples already of how search if failing, but the front line workers on your team may have critical insights you are missing about how they cannot get the information they need in a timely way to either do analysis, make decisions or land business.
  • Leaders in government agencies should similarly collect insights on how improved search and information retrieval capabilities can improve support to missions.
  • The actual methods being leveraged to improve computer language processing get very technical very quick. Leaders should ensure the details of these new approaches are well understood by the technical team. We have linked to key academic papers below which dive deep into what is happening here. It will help your organization be ready to move quicker in this space if the right people in your organization are fluent in these concepts.
  • All organizations should be able to establish not only requirements but a new vision for how search, discovery and information retrieval is done across the enterprise. Thanks to the breakthroughs referenced above, this vision can include the use of natural language processing capabilities to ensure the entire workforce has rapid access to the information they need, from any language, translated to their language. And this should all be done so fast that it seems instantaneous. Client or citizen facing organizations should establish a vision that ensures all customers have access to the information they need to make decisions, in any language they prefer, using search methods that are as easy as if they were asking a human assistant to retrieve their information.

Concluding Thought

There is a new wave of innovation underway that holds the potential to kick off many unforeseen positive impacts on humanity. This new wave of innovation is in the domain of computer understanding of human languages. The time to prepare for it is now.

For more reading on computer understanding of language

Efficient Estimation of Word Representations in Vector Space: This 2013 work enabled unsupervised training of models in a manner that enabled mapping of semantic relationships between words into an multi dimensional meaning space where distance between concepts can be measured. This allows information retrieval to take into account the relationship between different words and sentences.

Attention Is All You Need: This 2017 paper captures methods that enable transformer networks. Transformer networks incorporate the advantages of recurrent neural networks in ways that account for the sequential nature of language.

Open Sourcing BERT: State of the Art Pre-Training for Natural Language Processing: BERT introduced the idea of unsupervised training for language models, which is similar to how our brains work.

SQuAD: 100000 questions for machine comprehension of text: The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. It can be used as a benchmark for machine comprehension of text and natural language processing capabilities.

 

 

Bob Gourley

About the Author

Bob Gourley

Bob Gourley is an experienced Chief Technology Officer (CTO), Board Qualified Technical Executive (QTE), author and entrepreneur with extensive past performance in enterprise IT, corporate cybersecurity and data analytics. CTO of OODA LLC, a unique team of international experts which provide board advisory and cybersecurity consulting services. OODA publishes OODALoop.com. Bob has been an advisor to dozens of successful high tech startups and has conducted enterprise cybersecurity assessments for businesses in multiple sectors of the economy. He was a career Naval Intelligence Officer and is the former CTO of the Defense Intelligence Agency.