Start your day with intelligence. Get The OODA Daily Pulse.

Subscribe Sign In

Home > Analysis > Mergeflow CEO and OODA Network Member Florian Wolf on “Small Data” (Part 1 of 2)

Mergeflow CEO and OODA Network Member Florian Wolf on “Small Data” (Part 1 of 2)

09/22/2022 | Written by: Daniel Pereira

On multiple occasions, OODA Network member Florian Wolf has presented to the OODA Network membership on the topic of small data. An evangelist and subject matter expert on the topic, Wolf is the CEO of Mergeflow, a company he founded in 2007, where he is responsible for company strategy and product design. Wolf has a Ph.D. in Cognitive Sciences from MIT and is a former research associate in Computer Science and Genetics at the University of Cambridge. Some of his work at MIT was funded by DARPA and he is a member of the Global Panel at MIT Technology Review. Mergeflow initially developed analytics software for hedge fund investors, including some of what today is called “alternative data” (news, blogs, and social media). But, according to the Mergeflow website:

“…then the financial crisis raged, and customers started to skimp. Mergeflow’s small team was forced to rethink and pivot the entire company. Eventually, after exploring all kinds of options and possibilities, they met with a team at Siemens Corporate Technology that wanted to automate and scale some of their technology scouting activities. This project began as a proof-of-concept, but Mergeflow’s team quickly realized that this was exactly what they wanted to do all along: The perfect combination of “building cutting-edge analytics software” and “helping solve fascinating, and often somewhat hidden, challenges across a wide variety of tech sectors”. Today, Mergeflow’s software is used by companies worldwide to discover and explore emerging technologies, companies, ideas, experts, and new markets.

With this strategic, creative, and technical agility in mind, we checked in with Florian on all things machine learning and “small data”.

Background

Small Data, also at times conflated with tinyML, began to crop up in the OODA Loop Daily Pulse as early as October 2021, with the Scientific American analysis of ‘Small Data’ for Machine Learning, as well as TechTarget’s coverage of tinyML at the very edge of IoT.”

To start, we should differentiate between the working definitions of Small Data and TinyML:

tinyML: An AI software design ‘movement’, led by the tinyMl Foundation, which defines Tiny machine learning “as a fast-growing field of machine learning technologies and applications including hardware, algorithms, and software capable of performing on-device sensor data analytics at extremely low power, typically in the mW range and below, and hence enabling a variety of always-on use-cases and targeting battery operated devices.” (1)

Small Data: A working definition of small data is more complicated, but differs from tinyML in that small data is not organized around a device endpoint and does not require on-device performance specifically concerned with low power, “always on” or battery-operated use case cases.

From a systems design perspective, small data is a movement away from the assumption that all machine learning has as its cornerstone “big data” – or dependence on system designs that rely solely on large-scale, predominantly cloud-based datasets. There is a localization aspect to small data, but it differs greatly from that of tinyML in that it is not specifically at the hardware layer “endpoint”.

Florian Wolf describes small data as “the ability of machine learning or AI systems to learn from small training data sets”. Small data technologies include transfer learning and one-shot or few-shot learning. In transfer learning, you use a model trained with (lots of) data from one domain, and transfer it to a different but related problem. One-shot or few-shot learning aims to learn from one or a few labeled data points. Typically, some form of prior knowledge is incorporated into one-shot or few-shot learning models.”

Scientific American provided this context as a working definition of small data:

“When people hear “artificial intelligence,” many envision “big data.” There’s a reason for that: some of the most prominent AI breakthroughs in the past decade have relied on enormous data sets. Image classification made enormous strides in the 2010s thanks to the development of ImageNet, a data set containing millions of images hand sorted into thousands of categories. More recently GPT-3, a language model that uses deep learning to produce humanlike text, benefited from training on hundreds of billions of words of online text.

So it is not surprising to see AI being tightly connected with “big data” in the popular imagination. But AI is not only about large data sets, and research in “small data” approaches has grown extensively over the past decade—with so-called transfer learning as an especially promising example.

Also known as “fine-tuning,” transfer learning is helpful in settings where you have little data on the task of interest but abundant data on a related problem. The way it works is that you first train a model using a big data set and then retrain slightly using a smaller data set related to your specific problem. For example, by starting with an ImageNet classifier, researchers in Bangalore, India, used transfer learning to train a model to locate kidneys in ultrasound images using only 45 training examples. Likewise, a research team working on German-language speech recognition showed that they could improve their results by starting with an English-language speech model trained on a larger data set before using transfer learning to adjust that model for a smaller data set of German-language audio.” (2)

Scientific American authors Husanjot Chahal and Helen Toner go on to reference the Center for Security and Emerging Technology (CSET) October 2021 report, Small Data’s Big AI Potential, to clarify the subcategories of small data as a computer science methodology. “broken down in terms of five rough
categories of “small data” approaches:

Transfer Learning: Formatively described above, Transfer learning works by first learning how to perform a task in a setting where data is abundant, then “transferring” what it has learned there to a task where much less data is available. This is useful in settings where only a small amount of labeled data is available for the problem of interest, but a large amount of labeled data is available for a related problem.

Data Labeling: A category of approaches that starts with limited labeled data but abundant unlabeled data. Approaches in this category use a range of methods to make sense of the unlabeled data available, such as automatically generating labels (automated labeling) or identifying data points for which labels would be especially useful (active learning).

Artificial Data Generation: A category of approaches that seek to maximize how much information can be extracted from a small amount of data by creating new data points or other related techniques. This can range from simply making small changes to existing data (e.g., cropping or rotating images in an image classification dataset) to more complex methods that aim to infer the underlying structure of the available data and extrapolate from there.

Bayesian Methods: A large class of approaches to machine learning and statistics that have two features in common. First, they try to explicitly incorporate information about the structure of the problem—so-called “prior” information—into their approach to solving it. (7) This contrasts with most other approaches to machine learning, which tend to make minimal assumptions about the problem in question.

By incorporating this “prior” information before improving further based on the available data, Bayesian methods are better suited to some contexts where data is limited, but it is possible to write out information about the problem in a useful mathematical form. Second, Bayesian approaches focus on producing well-calibrated estimates of the uncertainty of their predictions. This is helpful in settings where there is limited data availability because Bayesian approaches to estimating uncertainty make it easier to identify data points that, if collected, would be most valuable in reducing uncertainty; and

Reinforcement Learning: Reinforcement learning is a broad term that refers to machine learning approaches in which an agent (the computer system) learns how to interact with its environment via trial and error. Reinforcement learning is often used to train game-playing systems, robots, and autonomous vehicles.

What we are calling “small data” is not a clean-cut category, and therefore does not have a single, formal, agreed-upon definition. Academic writings discuss small data in relation to the application area under consideration, often tying it to the size of the sample, for instance, kilobytes or megabytes versus terabytes of data. (3)

Popular media articles attempt to describe small data in relation to varied factors like its usability and human comprehension, or as data that comes in a volume and format that makes it accessible, informative, and actionable, especially for business decisions. (4) Many references to data often end up treating it as an all-purpose resource. However, data is not fungible, and AI systems in different domains call for distinct types of data and distinct types of approaches depending upon the problem at hand. (5)

These categories…are imperfect…our aim in delineating the categories...to give the reader a sense of some of the rough conceptual approaches that make it possible to train AI systems without large, pre-labeled datasets. The categories we use are not cleanly separable in practice, and they are neither mutually exclusive nor collectively exhaustive.” (6)

An OODA Loop Q&A with Mergeflow CEO Florian Wolf

“They assume that you don’t have three examples, but that you have a million examples. But try getting a million hand-labeled chest X-rays. Good luck.”

Daniel Pereira: How is Mergeflow doing?

Florian Wolf: Things are great for the company. We just expanded internationally. We will be opening an office in the U.S. So thank you for asking. We are a bootstrapped operation, and we are going great.

Pereira: And OODA Loop? Since you joined the network, how has the community and the site been as a tool for you as a CEO?

Wolf: It is great because it is tough to find great thought leadership, but with the OODA Loop community, it is hard to beat the value when I read something on OODA Loop.

Pereira: When did you become a member of the OODA Network?

Wolf: It had been on my radar for quite a while because you know, when I was in Boston at MIT, I was in a DARPA-funded program. So the community was on my radar for a long time, but then I guess really with COVID because before that all the meetings were in person. Right? And that was difficult. And then there is COVID and as bad as it was – one thing that it really opened up is it is so much easier to reach out globally now. It doesn’t mean it replaces in-person meetings, but you can do a lot online. To me, Covid-19 was the trigger. But I am looking forward to meeting the OODA Loop community in person at OODAcon.

Pereira: Absolutely. So your “first contact” with small data? Was it something that you and the Mergeflow team put together and then sought out a community of practice? What is the backstory of your relationship to small data?

Wolf: So, well, small data is just a label. It might not even be the best label. Small data is the reality check for machine learning in my experience. So it just came out of experience. What we saw all the time is you talk to someone, and they say “We have all this data. We need to learn something from this data. We need to analyze it We need to build models, we need to do all kinds of stuff.” and then you ask them “So, okay. Where is your data? Or how much data do you have?” And then it turns out, well, you cannot access the data for various reasons: it might be confidentiality. It might be politics. It might be all kinds of things. It might be for technical reasons. Or you have a lot of data, but it’s not homogenous at all because it is real-life stuff.

Small data is not necessarily about the amount of data, but it’s more about the homogeneity or lack thereof. When you see something in real life, it’s never homogeneous. And you have like three examples of one condition. And you have all those “academic algorithms” that people write about, but never really test against real-life problems. They assume that you don’t have three examples, but that you have a million examples. But try getting a million hand-labeled chest x-rays. Good luck.

“I think it’s important to always try to do the simplest, stupidest thing first before you move on to sophisticated things.”

Pereira: Right. So is there a community of practice formulating around small data? Because tinyML has a dot org that is pushing their perspective. And I’m always looking for the innovation pattern or the inflection point – like ImageNet, Deep Learning, and the whole reorientation of machine learning by Deepmind, Geoffrey Hinton, YannLeCun, and Andrew Ng. How are you thinking about small data in terms of where it is on a hype cycle or a trajectory of some sort?

Wolf: So, tinyML is something different. TinyML is about how you can do machine learning with as little energy as possible, which is extremely important if you consider just the electricity bill you would ramp up if you trained something like GPT-3. So it is clearly important, but I would say it is orthogonal to small data or not. But small data? I don’t think there is a community or anything about it yet. And, to be honest, part of the reason I wrote a blog article about it some time ago was I was just angry. The playbook that you see all the time is: a big organization buys a lot of expensive toys to do machine learning and AI. They might even buy hardware, they buy tons of software, and they get all those things, and then they notice, oh we don’t have the data.

And then they realize, we don’t even know what the problem is we are trying to solve. And it turns out, oh man, now that we’re thinking about the problem, it’s not one problem. It’s 10,000 problems. Or it’s not the problem that we thought it was, or it’s much more complicated – or any of those things. And, to me, it is fascinating and shocking at the same time to see how often people use huge machinery to sort of shoot at a problem first – without first trying to do the simplest things.

Wolf: It is extremely easy to not do the simple things first. It might not even be machine learning. My favorite example is if you do anything with text, for example, or any sort of sequence data, have you tried regular expressions – which is the cockroach of algorithms in this space? It is very simple. It is not like you are going to make it onto the cover of Wired magazine if you use it. You don’t get headlines, but it might be very effective.

I remember when I was at MIT – when I first started working on this DARPA project – someone on the project told us that before that project, when they were at SRI, they were trying to build a model or a system that analyzes military intelligence reports. And they used sophisticated expert systems with ontologies and all kinds of stuff. And the thing just never worked but they had to solve the problem. And then they locked themselves in a room for a few days and built something with regular expressions. Of course, it did not solve all the problems. It was like an 80/20 approach, but it was robust, and it solved a lot of the problems. And I think it’s important to always try to do the simplest, stupidest thing first before you move on to sophisticated things.

“In almost every case I’ve seen, the labeled data just does not exist. Or the data just does not exist in general. If you try to build a system that does anything, for example if you want to analyze decision-making in an enterprise, there is no such thing as homogeneous data.”

Pereira: But let’s place that in the context of the lean startup model and the customer development model. The problem solving you just described, where does that map in terms of going through the customer development model and not leading with small data as “the product” or that which you are trying to market? Some management teams fixate on the latest business strategy term – to the point that you would think that is what they are marketing, but they are not doing the analysis of the “pain point” – what they are trying to solve, or where the value creation is or whether they are entering an existing market or creating a new market, which is crucial. So what has been your experience with that?

Wolf: We started doing this out of necessity. When you first start out, no one will give you a budget of five million to try to solve something. You must start with a small proof of concept and the budget might be, if you’re lucky, you have a budget of $50,000. And you have to deliver something. And so you are desperate. You cannot do any of those fancy things. And then you know, you lock yourself up in a room like at SRI – and then you build a “try whatever works within a week” kind of model. Which is productive. And that is how it started, really. It is budgetary constraints, but not just budgetary constraints. In almost every case I’ve seen, the labeled data just did not exist.

Or the data just does not exist in general. If you try to build a system that does anything, for example if you want to analyze decision-making in an enterprise, there is no such thing as homogeneous data. I mean, every decision is different, right? It is not that you have the same decision over and over, which is what you would need for the more classical machine learning approaches.

“Look at financial risk modeling. If you assume that all this data is distributed in a Gaussian distribution, then you are…going to run into trouble because, in real life, it’s probably not.”

Pereira: That allows a smooth transition to a discussion of the CSET report on small data. You mention specifically that labeled data simply does not, and will not, exist for most ML projects. But, tell me if I’m wrong, in the CSET report they list high-level ML methodologies and they use language to suggest these ML methodologies were specific to small data, which I thought was a bit confusing. These techniques are the building blocks of ML as a discipline, but they offer them as ‘categories’ of small data.

In your working definition of small data, and that of the Scientific American article, transfer learning is central. So, let’s clear that up: am I wrong in my reading of the CSET report? And then let’s discuss transfer learning and why it is robust for small data – and then go through the other methodologies and their applicability to small data.

Wolf: My reading of what they are trying to do in the CSET report on small data is to give you a toolbox for a situation where you do not have tens of thousands of examples of homogeneous labeled data. And then you have these different tools you can use which they mentioned, like data labeling, transfer learning, data generation, all those things.

And it really depends on the situation you are in which one is probably going to work. So for transfer learning, the classical example I guess now would be something like GPT-3 if you have a language problem that’s related to what the model does. There is GPT-3 and your chances of success are pretty good if you fine-tune the model. We have started using GPT-3 ourselves. The very expensive question, of course, is “what does ‘related problem’ mean”? Is it still related if you are trying to analyze protein structures, which you could think of as sequences of text as well? It could be, but I don’t know. You must try that out.

Then data labeling. Let’s say you need a thousand examples of something – and I guess there is probably overlap between data labeling and artificial data generation or synthetic data – you can use a model to generate training data and then, as a human, go through the results and edit the ones that the model got wrong.

This assumes that you have a very good understanding of your problem domain. Because if you don’t have that – and if the assumptions that you and your model make are different from what is going on in your problem – then you have a problem. For example, look at financial risk modeling. If you assume that all this data is distributed in a Gaussian distribution, then you are probably going to run into trouble because, in real life, it’s probably not. It’s something different.

An example of a Gaussian distribution: You know that there are no humans that are a hundred feet tall. You just know that and it is a Gaussian distribution. So you can use that as a reasonable assumption. But if you use that for, for example, to model hedge fund blow-up risks, you might be in for a surprise because, in that example, you can have things that are the equivalent of meeting someone who is a thousand feet tall. You know – the Black Swan events. You do not have that with people. But you do have that in finance – and in many other domains.

Or if you’re trying to model the structural integrity of an airplane wing. That’s a different distribution as well, perhaps a Weibull distribution. And so you must know where you are to be able to use different kinds of approaches in a way that makes sense. And then the question becomes “how do you easily generate data without having to do everything manually”? That is something you can try to solve with those other ML small data approaches.

“…with reinforcement learning, if you can think of your problem as some kind of game where you have a clear definition of winning and losing that is independent of labeling data, and also the rules of the game, you can probably use reinforcement learning. It is like learning how to play Pong or Go…”

Defining these approaches is all with the caveat that this field develops rapidly all the time and things keep changing. But I think with reinforcement learning if you can think of your problem as some kind of game where you have a clear definition of winning and losing that is independent of labeling data, and also the rules of the game, you can probably use reinforcement learning. If it’s like learning how to play Pong or Go. For example – and I’m not sure if I am hopelessly oversimplifying this – if you try to guide a spacecraft through re-entry into the atmosphere, I think there is a clear definition of winning and losing in that case which is: if you win, you make it; If you lose, you burn.

Wolf: I don’t have the subject matter experience in this area, so I might be completely wrong, but from the outside, it looks to me that this could be a reinforcement problem – where you understand what permissible parameter ranges are, you’re measuring all kinds of temperatures and speeds and all that kind of stuff. And you are running through the simulation repeatedly to see where do you have to be so that you “win”?

Isn’t that what they did with the DARPA AlphaDogfight, where an AI pilot beat a human fighter pilot? I think it is something like that. And when I read about that simulation, everyone was surprised the machine could do this, but it was not surprising to me because that sounded like a reinforcement problem. It is like a game too. You win or you lose. It is like learning how to play Pong or Go.

“…you would have to hand label all these combinations, which is brutal. It is the equivalent of modeling every Go or Chess move – which is impossible.”

Pereira: So let’s tease that out a bit. So, for the spacecraft re-entry example, the big data set you’re “hitting” initially is some massive modeling data set on the NASA back-end, for example, then the team is looking at that and saying “here is our system design for the “reinforcement learning as small data” that is specific to our problem set”. One doesn’t need to assume that, because you’re using a reinforcement learning model as the big dataset, each of the small data sets must be reinforcement learning. You could apply any of these ML approaches, or categories as described by the CSET report authors, to that small data set?

Wolf: For the spacecraft re-entry example, I think it’s a bit different. I think that, for example, if you think about this “guiding a spacecraft through re-entry into the atmosphere problem” as a data labeling problem, you would create data sets that use, for example, “Sensor X has to be at X temperature range for this win/lose thing to work”, and you would have to hand label all these combinations, which is brutal. It is the equivalent of modeling every Go or Chess move – which is impossible.

So you think about the problem very differently. You don’t even think about it in terms of labeled data or small data. You think about it as “here are the parameters, the rules of the game for chess, for example: which piece can make which kinds of movements?” The combinations are infinite. But the rules are not. So you give the system the rules and there is no data labeling at all.

And I am saying this with caution because I don’t know enough about spacecraft re-entry. But it seems to me you can have a set of rules, like the temperatures inside the spacecraft cannot be a thousand degrees.

“…the examples I know with Bayesian models are really trying to explain how humans learn. How does a baby figure out things using basic physics principles, like how things fall, or the difference between solid and liquid things?”

Pereira: But of the conceptual abstractions that we could just choose for this discussion, it serves a great purpose here to focus on the structural challenges of data.

Wolf: It’s really just different ways of thinking about a problem. And with Bayesian methods vs. reinforcement learning, people will probably think I am crazy if I say these are related, but I think there is some relation there. The Bayesian approach would be “you know something about the world and you’re trying to use that knowledge for learning”. Like gravity, for example, things fall down, and you can use that knowledge to reduce your problem space so dramatically, because if you don’t have to think about things or situations where things don’t fall – or fall upward – you don’t have to consider that because you have that knowledge.

I’m not familiar with current production models that use this approach. But I just might not be aware of that. But the examples I know with Bayesian models are really trying to explain how humans learn. How does a baby figure out things using basic physics principles, like things fall, or the difference between solid and liquid things? Once you have those kinds of concepts, like there is a concept of something that is solid versus something that is liquid, you can use that to explain a lot of things. If something falls onto something that is solid, it will behave differently than if it falls onto something that is liquid. And this is your prior knowledge that you use for learning in Bayesian models.

Big data, small data, artificial intelligence machine learning, and deep learning will be discussed at OODAcon 2022 – The Future of Exponential Innovation & Disruption on the following panels:

Disruptive Futures: Digital Self Sovereignty, Blockchain, and AI

Fireside chat with Futurist and Author Karl Schroeder

You are big data. Every day the technology you own, use, and otherwise interact with (often unintentionally) collects rich data about every element of your daily life. This session provides a quick overview of how this data is collected, stored, and mined but then shifts direction to look at what technologies might empower users to better collect, access, and authorize the use of their own data through blockchain, digital autonomous corporations, and smart contracts.

Swimming with Black Swans – Innovation in an Age of Rapid Disruption

Dawn Meyerriecks, Former Director of CIA Science and Technology Directorate

If Yogi Berra were to evaluate today’s pace of global change, he might simply define it as “the more things change, the more they change”. Are we living in an exponential loop of global change or have we achieved escape velocity into a “to be defined” global future? Experts share their thoughts on leading through unprecedented change and how we position ourselves to maintain organizational resiliency while simultaneously reaping the benefits of new technologies and global realities.

The Future Hasn’t Arrived – Identifying the Next Generation of Technology Requirements

Neal Pollard, Former Global CISO at UBS | Partner, E&Y

Bobbie Stempfley, Former CIO at DISA | Former Director at US CERT | Vice President at Dell

Bill Spalding, Associate Deputy Director of CIA for Digital Innovation

In an age when the cyber and analytics markets are driving hundreds of billions of dollars in investments and solutions is there still room for innovation? This panel brings together executives and investors to identify what gaps exist in their solution stacks and to define what technologies hold the most promise for the future.

Postponing the Apocalypse: Funding the Next Generation of Innovation

What problem sets and global risks represent strategic investment opportunities that help reduce those risks, but also ensure future global competitiveness in key areas of national defense? This session will provide insights from investors making key investments in these technologies and fostering future high-value innovation.

Open the Pod Bay Door – Resetting the Clock on Artificial Intelligence

Mike Capps, CEO at Diveplane | Former President at Epic Games
Sean Gourley, CEO and Founder at Primer.AI

Artificial intelligence is like a great basketball headfake. We look towards AI but pass the ball to machine learning. But, that reality is quickly changing. This panel taps AI and machine learning experts to level-set our current capabilities in the field and define the roadmap over the next five years.

OODAcon 2022

To register for OODAcon, go to: OODAcon 2022 – The Future of Exponential Innovation & Disruption

Further Resources

Join us here at OODA Loop in our exploration of the future of small data. Florian provided the following formative reading list as a start:

Blog posts, news articles, reports

CSET Report, “Small Data’s Big AI Potential”: https://cset.georgetown.edu/publication/small-datas-big-ai-potential/

Mergeflow, “Small data: Machines that learn more from less”: https://scope.mergeflow.com/small-data/( authored by Florian)

Tech Xplore, “A math idea that may dramatically reduce the dataset size needed to train AI systems”, https://techxplore.com/news/2020-10-math-idea-dataset-size-ai.html

MIT Technology Review, “A radical new technique lets AI learn with practically no data”, https://www.technologyreview.com/2020/10/16/1010566/ai-machine-learning-with-tiny-data/

Tokyo Tech News, “Successful application of machine learning in the discovery of new polymers”, https://www.titech.ac.jp/english/news/2019/044593

MIT News, “Bridging the gap between human and machine vision”, https://news.mit.edu/2020/bridging-gap-between-human-and-machine-vision-0211

Wired, “Deepfakes Are Getting Better, But They’re Still Easy to Spot”, https://www.wired.com/story/deepfakes-getting-better-theyre-easy-spot/

Videos

Josh Tenenbaum…long (1h 35min): “MIT AGI: Building machines that see, learn, and think like people”, https://www.youtube.com/watch?v=7ROelYvo8f0&ab_channel=LexFridman

…short (5min): “The mathematics of natural intelligence”, https://www.youtube.com/watch?v=N3hUOJ-rUzQ&ab_channel=WorldEconomicForum

Papers

All papers are from arxiv.org because these are all open access papers:

“GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain” https://arxiv.org/abs/2109.02555: Transfer learning has its limits. The example here is transferring a model from a more-general domain to a more-specific biomedical domain.

“Few-shot Decoding of Brain Activation Maps” https://arxiv.org/abs/2010.12500: Using a few-shot approach for learning how to tell the cognitive state of a person by looking at their brain activity.

“ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition” https://arxiv.org/abs/2104.03841: A database for few-shot-teaching machines (such as robots) how to recognize various objects across various conditions (lighting conditions, for instance).

“A Comparison of Few-Shot Learning Methods for Underwater Optical and Sonar Image Classification” https://arxiv.org/abs/2005.04621: They argue that for their task, few-shot learning worked better than transfer learning.

“Towards Few-Shot Fact-Checking via Perplexity” https://arxiv.org/abs/2103.09535: Transfer learning for fact-checking claims related to COVID-19.

CSET Report References

(3) Chi Chen and Shyue Ping Ong, “AtomSets – A Hierarchical Transfer Learning Framework for Small and Large Materials Datasets,” arXiv preprint arXiv:
2102.02401 (2021), https://arxiv.org/pdf/2102.02401.pdf; H. James Wilson and Paul R. Daugherty, “Small Data Can Play a Big Role in AI,” Harvard Business Review, February 17, 2020, https://hbr.org/2020/02/small-data-can-play-a-big-role-in-ai; Rafael S. Pereira, Alexis Joly, Patrick Valduriez, and Fabio Porto, “Hyperspherical embedding for novel class classification,” arXiv preprint arXiv: 2102.03243 (2021), https://arxiv.org/pdf/2102.03243.pdf.

(4) Ahmed Banafa, “Small Data vs. Big Data: Back to the Basics,” BBVA OpenMind, July 25, 2016, https://www.bbvaopenmind.com/en/technology/digital-world/small-data-vs-big-data-back-to-the-basics/; “What is small data (in just 4 minutes),” Wonderflow Blog, April 1, 2019, https://www.wonderflow.ai/blog/what-is-small-data; Ben Clark, “Big Data vs. Small Data – What’s the Difference?,” iDashboards, December 19, 2018, https://www.idashboards.com/blog/2018/12/19/big-data-vs-small-data-whats-the-difference/; Priya Pedamkar, “Small Data vs Big Data,” Educba, accessed August 2021, https://www.educba.com/small-data-vs-big-data/.

(5) Husanjot Chahal, Ryan Fedasiuk, and Carrick Flynn, “Messier than Oil: Assessing Data Advantage in Military AI” (Center for Security and Emerging
Technology, July 2020), https://cset.georgetown.edu/publication/messier-than oil-assessing-data-advantage-in-military-ai/.

(7) Please note that non-Bayesian methods may also incorporate information about the structure of the problem. However, Bayes has another advantage, well-calibrated uncertainty, relevant to small data.

Stay Informed

It should go without saying that tracking threats are critical to inform your actions. This includes reading our OODA Daily Pulse, which will give you insights into the nature of the threat and risks to business operations.

Mergeflow CEO and OODA Network Member Florian Wolf on “Small Data” (Part 1 of 2)

Background

An OODA Loop Q&A with Mergeflow CEO Florian Wolf

OODAcon 2022

Further Resources

Blog posts, news articles, reports

Videos

Papers

CSET Report References

Stay Informed

Related Reading:

Explore OODA Research and Analysis

Decision Intelligence

Disruptive/Exponential Technology

Security and Resiliency

Community

Daniel Pereira

Background

An OODA Loop Q&A with Mergeflow CEO Florian Wolf

OODAcon 2022

Further Resources

Blog posts, news articles, reports

Videos

Papers

CSET Report References

Stay Informed

Related Reading:

Explore OODA Research and Analysis

Decision Intelligence

Disruptive/Exponential Technology

Security and Resiliency

Community

Daniel Pereira

Related Posts

Subscribe to OODA Daily Pulse