Simplifying data management in the cloud

On Dec 6, 2019

Let’s face it. We’re painting ourselves into a corner given the number of special-purpose, cloud-native databases we’re placing into production. This without regard to how we should be using those databases in ways that allow for easy access and understanding of data. Today, they are typically coupled to applications, and have a tactical and not a strategic purpose.

This is not the purpose of data, and not the promise of the cloud. Keep in mind that data in the cloud was built up in our minds to make data more accessible and centralized. Finally, we would be able to do “wonderful things with our data.”

This is not to say that data and databases have stayed expensive to obtain and operate. That alone has been a major advantage of public clouds. You can go from “need a database” to “have a database” in about a day or less thanks to the wonderful world of on-demand cloud infrastructure.

But the ease of obtaining cloud-native databases, and thus building net new databases, has led to a data complexity issue, with a few core downsides:

Typically, there is no common understanding of all enterprise data and the context of that data. Data still is largely siloed, perhaps even worse than it was 10 years ago when our journey to the public cloud began.
Now we face unintended consequences, such as not having the understanding needed to deal with security, data governance, or even leveraging a “single source of truth.”

We do have ongoing projects tackling the issue of data complexity, such as the Linked Open Data Cloud. The Linked Open Data Cloud provides a loosely coupled collection of data, information, and knowledge that’s accessible by any human or machine with access to the Internet. The intent is to create an abstraction layer provided by the web. It permits both basic and sophisticated lookup-oriented access using either the SPARQL query language or SQL, to provide access to structured and unstructured data much in the same way that we have accessed web pages to get at image and text pages since the web started.

Of course, an array of technology providers offer solutions as well, such as master data management, data virtualization, and other technologies that allow you to manage complex data in improved ways. In other words, providing data semantics and metadata management outside of the databases, cloud or not.

What I’m understanding now is that this approach to deal with cloud, cross-cloud, or hybrid cloud database management just won’t scale or age well. In looking at the requirements now, as well as in the near or far future, the data will get more complex, including becoming technologically diverse, considering the rapid pace of innovation.

Attempting to leverage the approaches and tools we use today will add complexity until the systems eventually collapse from the weight of it. Just think of the number of tools in your data center today that cause you to ask “what were they thinking?” Indeed, they were thinking much the same way we’re thinking today, including looking for tactical solutions that will eventually not provide the value they once did—and in some cases providing negative value.

I’ve come a long way to make a pitch to you, but as I’m thinking about how we solve this issue, an approach seems to pop up over and over as the best likely solution. Indeed, it’s been kicked around in different academic circles. It’s the notion of self-identifying data.

I’ll likely hit this topic again at some point, but here’s the idea: Take the autonomous data concept a few steps further by embedding more intelligence with the data and more knowledge about the data itself. We would gain the ability to have all knowledge around the use of the data available by the data itself, no matter where it’s stored, or where the information is requested. This will reduce data complexity twenty-fold, if we implement this concept in the same ways.

Some of the unique value would include:

The end game for any company is to take advantage of all the data it has at its disposal, make it accessible to anybody in the company who needs it to gain insights, and use those insights to make the business run better.
With increasing automation, data teams can work on higher order, more valuable problems. They can focus on the data, not the data platform, to achieve data success.
This is a core issue that complex data management needs to solve. Self-identifying data provides the specifics needed to address the issue of data complexity.

Lots of work needs to get done to make this a reality. I’m not sure we have that much time; missing something like this is hurting how effectively we use cloud computing today. Time to get work.