Produced by every single organization, data is the common denominator across industries as we look to advance how cloud and AI are incorporated into our operations and daily lives.
Before the potential of cloud-powered data science and AI is fully realized, however, we first face the challenge of grappling with the sheer volume of data. This means figuring out how to turn its velocity and mass from an overwhelming firehouse into an organized stream of intelligence.
To capture all the complex data streaming into systems from various sources, businesses have turned to data lakes. Often on the cloud, these are storage repositories that hold an enormous amount of data until it's ready to be analyzed: raw or refined, and structured or unstructured.
This concept seems sound: the more data companies can collect, the less likely they are to miss important patterns and trends coming from their data. However, a data scientist will quickly tell you that the data lake approach is a recipe for a data swamp, and there are a few reasons why.
First, a good amount of data is often hastily stored, without a consistent strategy in place around how to organize, govern and maintain it. Think of your junk drawer at home: Various items get thrown in at random over time, until it's often impossible to find something you're looking for in the drawer, as it's gotten buried.
[ Which NoSQL database should you use?