Start your day with intelligence. Get The OODA Daily Pulse.

Subscribe Sign In

Home > Analysis > Mergeflow CEO and OODA Network Member Florian Wolf on “Small Data” (Part 2 of 2)

Mergeflow CEO and OODA Network Member Florian Wolf on “Small Data” (Part 2 of 2)

09/28/2022 | Written by: Daniel Pereira

In Part 1 of this interview, we checked in with Mergeflow CEO and OODA Network Member Florian Wolf about all things machine learning and small data.

In Part 2, we continue the conversation with Florian spanning a wide range of topics, including the real-life lessons of machine learning innovation, the centrality of data, problem-solving over performance as the competitive advantage while developing ML systems, leading a team, what keeps him up at night, what he is tracking and what he is excited and hopeful about in the future.

“The idea is that, rather than having to brainstorm with a bunch of people about what the structure of your topic could be like, to get started, you can use something like this AI-based mindmap generator. That is where we use it in real life.”

Daniel Pereira: At Mergeflow, what has been a use case or deliverable using small data approaches that would be of interest to OODA Loop members?

Florian Wolf: The most recent one (we just finished building last week) we call mindmap generator. I can show you how it works. The idea is this: imagine you’re searching for a topic, and you want a breakdown of the topic, examples, categories that it falls into, and so on.

1 – Florian’s demo of Mergeflow’s Mindmap Generator search for Transfer Learning

Wolf: Okay, so this is our software. Let’s use “transfer learning” as an example. I type in “transfer learning.” Then I start the mindmap generator, which is using GPT-3 as a backbone. And we’ve built fine-tuned models on top of it. And this is what it produces for my query. It gives me a definition or a description of what it is. And a few other things, for example, that transfer learning falls under the categories of machine learning and artificial intelligence. It gives you examples of companies. That is all generated by this model.

You can do this for all kinds of stuff. Like agricultural drones, for example, I’m just thinking of this right now. Just so you see, it is not really domain-specific or anything. The idea is that, rather than having to brainstorm with a bunch of people what the structure of your topic could be, to get started, you can use something like this AI-based mindmap generator. That is where we use it in real life.

2 – Florian’s demo of Mergeflow’s Mindmap Generator search for Agricultural Drones

Pereira: So, specific to a small data approach, break down the small data design elements of what you are showing us.

Wolf: So how is this small data? This is transfer learning. It is transfer learning because we are using this huge language model GPT-3 from OpenAI. And we built our own fine-tuned models on top of that that are specific to this domain. We could have built something else. We could have built something that authors sarcastic poems.

“No – people did not write this. This is generated on the fly by a machine.”

Pereira: So, again, the maddening default behavior is that no one was ever evolving to a request outside of leveraging GPT-3 as a big data set from the cloud. no data sub stratification, or an instinct for data structure, Clients want to just source the big model generated from a huge dataset. And, again, they would point to their wealth of data which, upon further engagement with clients, was either non-existent or there were access issues to the data.

Wolf: Right. And in this case, what we did is, for each of the different bold-faced categories, we built a different small data model. For example, for the description, you must train a model that writes descriptions, and for the companies, you must write a model or use a fine-tuned model that produces company names that are relevant to a certain technology. So that’s how you build and design with small data.

Wolf: And the outcome is surprisingly human. It’s never perfect, but humans aren’t either.

Pereira: With your cognitive science background, I do not take that insight lightly. The fact that machine learning went away from the “replicating human behavior” framework into deep learning and the evolving model of “MT models-as-a-service” with things like GPT-3. The fact that you are returning to describing the results of small data, as a cognitive scientist, as “human-like.” That is really interesting.

Wolf: It probably does something in the background that is nowhere near the way humans process language or generates language. But the outcome is impressive. When I first showed this to people, they asked me how many topics we have to generate those descriptions, because they were assuming that people had written them. I have to tell them “No – people did not write this. This is generated on the fly by a machine.”

“Throw data at the problem or throw an algorithm at the problem. What does that even mean?”

Pereira: Is small data adoption an opportunity for advantage at speed and speed scale?

Wolf: I think the opportunity for advantage is that small data is not a business model, small data is a way of thinking, right? And the small data way of thinking is not about thinking about methods, but thinking about the problem you’re trying to solve. Try to know what you’re doing. Be reasonable, in the true sense of the word, that’s it really.

In terms of scale and speed, yes, I would say it helps make progress toward both because you can start with doing some amazingly simple, basic stuff first rather than building up a huge infrastructure for big data computation, whatever that might be. The hard part is you must really think through your problem. That is uncomfortable and very often it is also seen as mundane.

And I guess the frustration that I mentioned earlier in the conversation came after I heard for the umpteenth time someone say something like “Throw data at the problem or throw an algorithm at the problem.” What does that even mean?

“…even a broken clock is right twice a day…it is exactly that type of thing.”

Pereira: It seems like we are really running the risk of being “stuck on stuck” – whether it is the social media algorithms running riot, while also putting algorithms on a pedestal as a cure-all at the same time. That is the space we are in and that is a problem.

Wolf: You cannot just “use some data” because you’re always going to get some outcome if you run some algorithm through some data. But if you don’t understand what made that outcome come about – then you have not really made much progress. And so I think the real opportunity for advantage is that if you really understand your outcomes and how you got there, you are going to be one of the very few people who is doing that. Because everyone else is using the “throw some stuff at something else” playbook.

Pereira: There is all this talk of big data, but when you are in the trenches, it is “no data.” What you are saying is very similar to the message LANDING AI Founder and CEO Andrew Ng has been delivering for the last year or two, which is a message about the importance of data structure over everything else. Ng describes it as a movement from model-centric AI to data-centric AI.

I think there is overlap with the conclusions that Ng has made – and the way he’s shifted gears in terms of the business model of his current company. And he is also talking a lot about data prep.

So, let’s talk process. You and your team walk into the conference room with a blank whiteboard, you choose either through client request or contextual necessity the big data you will be using. Let’s say you’re going to lean on GPT-3. And then, in terms of process, you then take these five ML methodologies -and then what? Would you and the team click through each and talk about what you are going to fine-tune?

Wolf: We would do other things that are even a lot simpler first, though. Like just simple string-matching, a simple counting of stuff, simple statistics, the most basic stuff. For example, imagine you had a language where 90% of the words you see is the word “the”. And your task is to predict the next word. If you just always predicted “the”, without any model, you would be 90% accurate as a baseline.

Pereira: … even a broken clock is right twice a day…

Wolf: …it is exactly that type of thing. So you must do those simple things first. Make sure you understand what the baseline is. Because you’ll look stupid if you use all the sophisticated machinery and then someone shoots you out the pool with something as simple as that.

“What really makes a difference is how you think about your problem.”

Wolf: It’s like you are trying to build a small cabin in the woods. And recently you’ve read all these things about how they build skyscrapers. And so now you’re thinking, oh, I need this tool; I need this crane; I need this and that and all the other things that they used to build a skyscraper, but then you realize at some point: I am trying to build a cabin. That’s not going to be applicable.

I think it is always surprising to me when you talk to people, especially those who come out of academia and study machine learning. They’re often obsessed with algorithms and those kinds of things, but they never look at raw data because it is mundane. But you must look at the data and you must start with the data.

You must really get your hands dirty with doing that kind of stuff. Forget about the algorithms at first, because that’s not really going to make the biggest difference. What really makes a difference is how you think about your problem.

“…very often, the difference in performance between algorithms is not really that great…where you really make a difference is when you understand the problem you’re trying to solve.”

Pereira: Let’s shift gears into geopolitical futures: should small data be more firmly integrated into strategic thinking on innovation and national competitiveness?

Wolf: Absolutely. Because if you do the thing that the usual suspects always do for the millionth time, which is get the big machinery, get someone who knows this and that algorithm, but you don’t think about the problem – then you are doing exactly what everyone else is doing.

I forgot who it was. Someone said, in one of the OODA Network member meetings, that they don’t think that what differentiates their company is the algorithms they use, but the way they deal with data or label data. And I was thinking: that is exactly the point because the algorithms can be pretty interchangeable. And, very often, the difference in performance is not really that great. But where you really make a difference is when you understand the problem you’re trying to solve.

“…to put a model into enterprise-scale production…you must deal with all these integrations and all the things you encounter in real life. Like this controller cannot talk to that one over there, all those things.”

Pereira: This is interesting because between Eric Schmidt, the Special Competitive Studies Project (SCSP), and America’s Frontier Fund, the private sector really has its sleeves rolled up in a historically unprecedented way. I’m reminded of the historical fact that Silicon Valley has flinched on the deep tech stuff. And it really was as simple as not wanting to take on the risk of the CAPEX amounts for these projects.

It is like NASA versus the Space Industry ecosystem now. Silicon Valley has not been willing to get into the $10B to $100B range. That was left to the government. Now, Eric Schmidt is arguing that the private sector must play a role in the capitalization of things like carbon capture innovation, for example. So they want to “go big” now, but, consistent with the fixation on “big data”, will there be a lot of value being left on the table for solutions at the small to medium scale?

Wolf: Let me give you another example: There is an organization called Wellcome Leap which I find interesting. It was started by Regina E. Dugan, who was head of DARPA previously, and they are trying to do “DARPA for Life Sciences”. DARPA does Life Sciences too, of course.

And they do things like, imagine you could do a clinical trial within some crazy timeframe, like four weeks rather than six months or whatever it takes now. Even if you don’t reach the crazy timeframe, you’ll probably substantially reduce the amount of time it takes for a clinical trial. It is my understanding of the SCSP that that will be similar. They pick a big problem, like trying to reduce CO2 emissions in steel production by 50%, for example. I don’t know about this specific problem, I just made it up as an example.

So, if you pick a target like that you are necessarily going to run into all these other things – including small data. It is just going to happen. I remember a conversation I had with an investor in Deepmind, which was sold to Google, and one of the most famous Deepmind case studies was about energy savings in Google Data Centers.

They told me that to get the initial model working, which was not implemented in a real data center, that was a team of I think maybe five or so people and it took them a few weeks. But then to put that model into production, so that you can really run it in the data center, that was two hundred people in eight months, because then you must deal with all these integrations and all the things you encounter in real life. Like this controller cannot talk to that one over there, all those things. So I think this is similar to the things SCSP does.

“Let’s look at data ownership. If a person says “the world is flat” and the second person says “I don’t think so”, this phrase, “I don’t think so”, who does that belong to?”

Pereira: But back to Andrew Ng’s concern that there is a bigger challenge with data ahead. So let’s move away from the big data, small data dichotomy and get the question of what else are you tracking or concerned with about how data is discussed or issues of “data writ large”? What is blipping on your radar screen?

Wolf: One thing when you think about transfer learning or, no sorry, any kind of learning, is how are all these approaches somehow domain specific? That is not an easy question to answer. “Domain-specific” is getting wider. With language models, for example, it used to be that if you could do something for newspaper texts versus, I don’t know, operating instructions for machinery, you would have to start all over again with your model.

That’s not necessarily the case anymore, but then the question is how far can you extend the domain? And how could you have an algorithm that knows when it breaks down? Like some sort of self-awareness, like “I’m out of my area of expertise here.” If an algorithm could detect that, that would be fantastic. So that is something that I find interesting.

So that’s the first thing. The second thing is the unpredictable. I remember a time when I thought, like many other people thought, that machine translation would never reach near human levels. I remember that time. But when you look at machine translation now, that’s clearly wrong. It is near humanlike. So that’s the second thing: what is out there that I just don’t know? There’s a great book by Eugene Charniak called Introduction to Deep Learning. And in the preface to the book – he is a professor for machine learning at Brown University, Ph.D. from MIT, one of the best people in the field – he talks about how he was caught by surprise by deep learning. He didn’t expect it to work that well. And many other people didn’t either. It caught a lot of people by surprise. So big developments might happen sooner than we think.

And the third thing I think about is regulation, particularly regulations with unintended bad consequences. Like anything that has got to do with privacy or data ownership, or the ethics of AI. There is a lot of stuff going on in the EU which I am really worried about.

For example, let’s look at data ownership. If a person says “the world is flat” and the second person says “I don’t think so”, this phrase, “I don’t think so”, who does that belong to? It refers to what the other person said, so it depends on that context. Does the first person then get “partial ownership” of “I don’t think so”? So, regulations in that space are something that worries me.

“What do you do if, for some reason, you simply cannot access any data anymore? And that is a possibility. You can regulate yourself into that kind of a risk.”

Pereira: Regulation and privacy. Thank you for that. That was so specific. You gave three answers that were so specific. Those are fascinating responses. I just put in the chat the book Genius Makers By by Cade Metz It is the whole origin story of deep learning, when Hinton met this guy, then met that guy, like the whole early phase of Deepmind. The book is about the players, it’s about their personal stories. So it is a really human narrative about people grappling with the emergence of a transformative idea and the impact of that idea on their lives, the market, and the AI discipline.

O.K. With all the challenges ahead, what keeps you up at night?

Wolf:…I think it might be regulation…

Pereira:…regulation and privacy…

Wolf: What do you do if, for some reason, you simply cannot access any data anymore? And that is a possibility. You can regulate yourself into that kind of risk.

“…there are people who say, “I want to own my data” and I’d ask them, “What do you mean exactly“?

Pereira: Matt and I have discussed OODA Loop research on digital sovereignty, specifically using the speculative design process to ask: “What would a digital sovereignty platform-as-a-service or managed service look like? The core design principle is ownership of data by the individual. I then did a literature review of some things that Matt had captured recently in his Global Frequency newsletter, and the discourse has shifted to digital sovereignty as an extension of nation-state sovereignty. It is only, I swear to you, in the last few weeks that the phrase has been used in the context of the nation-state.

Our working assumption for our research continues to be: What would a blockchain-based platform look like that is designed for personal data ownership? What would the stack look like? What are the cybersecurity implications?

These issues of nation-state digital sovereignty had not even blipped on our map until now and I was surprised to see that the nation-state regulatory bodies had poached the phrase. The EU and some other regulatory bodies are talking about digital sovereignty in terms of localization, the provenance of local storage, and other issues, which really muddy the waters.

Wolf: To me, this would be a serious step backward and it would also deny the complexity of the issue. And particularly here in Europe, there are all kinds of people who say, “I want to own my data” and I’d ask them, “What do you mean exactly“?

I then give the example I just gave you where someone says something and someone else gives a response. Who owns the response – because the response is not a standalone data point. It refers to something else. How do you deal with that? If you think about a conversation, do I own the half that I said, and you own the half that you said? It doesn’t make sense.

Pereira: The disservice of the Web3 marketing and branding is that these policymakers cannot separate out the promise of blockchain relative to the turbulence and creative destruction of crypto. They are spooked. They also have to prioritize digital assets and digital currency policy, as the traditional economic framing of value capture and storage is really where their self-interest lies in the end. They do not realize or simply do not care that blockchain provenance extends to the future of the individual’s relationship to the data set as well. It does not start or stop at the nation-state level.

“…everyone knows we cannot be perfect, but we should at least try to do the best we can. Even if it means that you must tell someone no sometimes, right? – which is hard to do.”

Pereira: On to the next question: Former GE CEO Jeff Immelt tells a story about managing a high-performance team. He learned that you cannot come in with frequency and put the gauntlet down simply because you are the CEO. Immelt felt like he was down to do that maybe twice a year where he walks into the room and says “Nothing is working. This is what’s going to happen.” With that in mind, what have you built into the foundation of Mergeflow, from a risk and data strategy perspective, so that you don’t have to come “top-down ”as the CEO to fix something?

Wolf: It is, try to understand what you are doing. That goes a long way. You must really think through the problem you’re trying to solve. You must think in terms of time scales, cost, all kinds of resources, all kinds of things. Try to understand. Do not do black boxes. And I say, try to understand, because when you think about the cyber security aspect, for example, and everyone knows we cannot be perfect, but we should at least try to do the best we can. Even if it means that you must tell someone no sometimes, right? Like, do we do this quickly? And you might have to say no because if we did that, we would have to cut corners. We cannot do that – which is hard to do.

“That is the stuff that we humans do. We know what context is and what is appropriate for one context versus others.”

Pereira: Strategically, culturally, tell me if I’m wrong, but that means you are not structured around DevOps and Agile? You are a machine learning research shop first, and you must give the team the leeway to understand what they are doing. As a company, you must have a completely innovative organizational structure. What do you think about DevOps and Agile from that perspective?

Wolf: The way I see these terms used a lot is that people seem to think “agile” is the equivalent of “fast”, and so agile is trying to do the same stuff you used to do – but just faster. But that is not it. There is a quote by an Italian writer, Italo Calvino: “Lightness goes with precision and determination, not with vagueness and the haphazard.” I like that. To me, that’s the perfect definition of agile. “Agile” means “you are in control”. And when you’re in control, things become “light”.

Pereira: I totally get that quote and I think a lot of the readers are going to get it too. It reminds me of the intensity and clarity of insight of French theorist Paul Virilio, who is in the OODA Loop theory pantheon, he has a seminal quote: “When you invent the ship, you also invent the shipwreck; when you invent the plane you also invent the plane crash; and when you invent electricity, you invent electrocution… Every technology carries its own negativity, which is invented at the same time as technical progress.” I think he and Calvino were contemporaries. There is probably a Calvino/Virilio conference we should track down.

Uncertainty and exponential disruption? You were talking about transfer learning and this break-off point where it used to not be able to learn from itself but now it does. Exponential organizations and exponentials, as a business framework, have two core drivers: First, once any domain, discipline, technology, or industry becomes information-enabled and powered by information flows, its price/performance begins doubling approximately annually. Second, once that doubling pattern starts, it doesn’t stop. We use current computers to design faster computers, which then build faster computers, and so on.

That is what I heard you describe with the growth of transfer learning. Exponentiality is happening in this space. That makes small data a tangible manifestation of exponential growth patterns in AI and ML.

Wolf: It is still an open question. That’s why this question of context shift is so important. How far can you shift an algorithm out of context – so it still “knows” what it is doing? That is the stuff that we humans do. We know what context is and what is appropriate for one context versus others. Machines cannot do that yet, but I learned not to make predictions about those things.

“…to me, these developments are a great thing because, if you use them correctly, you can really kind of elevate your intelligence.”

Pereira: So uncertainty, exponential disruption, and reasons for optimism and hope. What has crossed your desk recently where you thought “Oh my God, that is actually a critical path to the solution of a really disturbing problem, there is a light at the end of the tunnel?”

Wolf: It is many things, but to pick one on my home turf, software, once we started experimenting with what I showed you before, the mindmap generation, it really hits you when you first see that and you think, “Wow, this is categorically different.” Because software used to be about number crunching. And now you can use software as a sparring partner for thinking.

That is fascinating because if you think about what that means and the applicability to all kinds of things, and the way we will write software, it could transform software engineering. Another thing that hit me a few weeks ago was when I looked on LinkedIn and I saw the first people who had “Prompt Engineer” as a job title. A prompt engineer is a person that tells those big language models what you want from them. Wow. This is cool because this could be the beginning really of a new era.

When you think about how you write now, which is completely different from what it used to be just two years ago, because you can use software to help you think, if you’re stuck writing something, you can use something like GPT-3 or all these different models and you can say, just suggest some different wording for this. It’s not just copy+paste though. You’re not just using the model to tell you what to do, but it triggers a thought process.

Pereira: From a humanities perspective, it’s like software as a Socratic method?

Wolf: You know, I see that as a good thing. There are all kinds of dystopian visions for AI, and I think there is justification for some of those too, but to me, these developments are a great thing because, if you use them correctly, you can really kind of elevate your intelligence.

Big data, small data, artificial intelligence machine learning, and deep learning will be discussed at OODAcon 2022 – The Future of Exponential Innovation & Disruption on the following panels:

Disruptive Futures: Digital Self Sovereignty, Blockchain, and AI

Fireside chat with Futurist and Author Karl Schroeder

You are big data. Every day the technology you own, use, and otherwise interact with (often unintentionally) collects rich data about every element of your daily life. This session provides a quick overview of how this data is collected, stored, and mined but then shifts direction to look at what technologies might empower users to better collect, access, and authorize the use of their own data through blockchain, digital autonomous corporations, and smart contracts.

Swimming with Black Swans – Innovation in an Age of Rapid Disruption

Dawn Meyerriecks, Former Director of CIA Science and Technology Directorate

If Yogi Berra were to evaluate today’s pace of global change, he might simply define it as “the more things change, the more they change”. Are we living in an exponential loop of global change or have we achieved escape velocity into a “to be defined” global future? Experts share their thoughts on leading through unprecedented change and how we position ourselves to maintain organizational resiliency while simultaneously reaping the benefits of new technologies and global realities.

The Future Hasn’t Arrived – Identifying the Next Generation of Technology Requirements

Neal Pollard, Former Global CISO at UBS | Partner, E&Y

Bobbie Stempfley, Former CIO at DISA | Former Director at US CERT | Vice President at Dell

Bill Spalding, Associate Deputy Director of CIA for Digital Innovation

In an age when the cyber and analytics markets are driving hundreds of billions of dollars in investments and solutions is there still room for innovation? This panel brings together executives and investors to identify what gaps exist in their solution stacks and to define what technologies hold the most promise for the future.

Postponing the Apocalypse: Funding the Next Generation of Innovation

What problem sets and global risks represent strategic investment opportunities that help reduce those risks, but also ensure future global competitiveness in key areas of national defense? This session will provide insights from investors making key investments in these technologies and fostering future high-value innovation.

Open the Pod Bay Door – Resetting the Clock on Artificial Intelligence

Mike Capps, CEO at Diveplane | Former President at Epic Games

Sean Gourley, CEO and Founder at Primer.AI

Artificial intelligence is like a great basketball headfake. We look towards AI but pass the ball to machine learning. But, that reality is quickly changing. This panel taps AI and machine learning experts to level-set our current capabilities in the field and define the roadmap over the next five years.

OODAcon 2022

To register for OODAcon, go to: OODAcon 2022 – The Future of Exponential Innovation & Disruption

https://oodaloop.com/archive/2022/09/22/mergeflow-ceo-and-ooda-network-member-florian-wolf-on-small-data-part-1-of-2/

Stay Informed

It should go without saying that tracking threats are critical to inform your actions. This includes reading our OODA Daily Pulse, which will give you insights into the nature of the threat and risks to business operations.

Mergeflow CEO and OODA Network Member Florian Wolf on “Small Data” (Part 2 of 2)

OODAcon 2022

Stay Informed

Related Reading:

Explore OODA Research and Analysis

Decision Intelligence

Disruptive/Exponential Technology

Security and Resiliency

Community

Daniel Pereira

OODAcon 2022

Stay Informed

Related Reading:

Explore OODA Research and Analysis

Decision Intelligence

Disruptive/Exponential Technology

Security and Resiliency

Community

Daniel Pereira

Related Posts

Subscribe to OODA Daily Pulse