Be Water, My Friend - Data Is A Liquid
By: Brady Bastian
I don't think that Bruce Lee had data in mind when he developed his philosophy, but I find the idea to "be like water" is a great way to look at data. His wonderful suggestion to "be shapeless, formless" - to be able to take any shape - can apply to data engineering in surprising, interesting, and useful ways. Bruce tells us that water can "flow, or it can crash", likewise data can flow, crash, transform, or represent but the essential element, the integrity of the data itself, is always maintained.
Be Water, My Friend
I don't think that Bruce Lee had data in mind when he developed his philosophy, but I find the idea to "be like water" is a great way to look at data. His wonderful suggestion to "be shapeless, formless" - to be able to take any shape - can apply to data engineering in surprising, interesting, and useful ways. Bruce tells us that water can "flow, or it can crash", likewise data can flow, crash, transform, or represent but the essential element, the integrity of the data itself, is always maintained.
Be Formless, Shapeless
When we think of data, what comes to mind? Likely you're thinking of spreadsheets, tables, or charts. This is close to correct, but these are actually not the data themselves, these are the shapes of the data. These are representations, such as projecting a hand puppet on a wall. Depending upon how you alter the light source you can make the shadow bigger or smaller, move it across the surface, or even change it's shape. Nonetheless, the shadow is still faithful to your hand and cannot exist without it. The light is not the source of the shadow, but it transforms the shadow of your hand into the projection on the wall.
When we refer to data we call them "elements". What does this mean, exactly? Well, what is an element? A quick Wikipedia search reveals that "A chemical element is a chemical substance that cannot be broken down into other substances." We believe that data is matter, that is has substance, and it has essential qualities or properties at an atomic level which cannot be broken down further without destroying the substance of that data element.
When you have atomic level data, there exists such a unique combination of data elements such that an object (or schema) is faithfully represented in its entirety. These would be the data "compounds" which, when properly combined, faithfully form a complete "object" (such as a customer transaction). The transaction cannot exist without each necessary data element being present and correct. If you remove or alter one of the elemental pieces of data you risk nullifying the entire object. But, you can transform these elemental data in different ways to produce alternative understandings of the data compounds.
That is very abstract and theoretical, so let's go through an example. If you look at an individual customer order, for example, it consists of several data elements without which the order "could not exist". Things like a positive identification of the customer, the items purchased, the currency used, the payment type and amount, and so on. If any of those are missing, the order object is not an effective unit of information. However, Let's say you're missing an element, such as the customer id. You know for sure that this transaction occurred, but you can't identify the customer. The order itself is likely corrupted, but the essential data elements are still relevant. You still have sales amount, items purchased, taxes paid, currency, and others. Even though the order object itself is unfaithful, this doesn't preclude using the elements to build metrics.
Bruce Lee tells us that water (or any liquid), can take any shape, and indeed "when you pour water into a cup, it becomes the cup". In the example above, when the data elements are poured into the order object, they become the order. If the shape of our order is broken, then it becomes useless as an order in an of itself, but the data elements, or water, is still water. It still exists as a faithful representation of the element. In our above example, we cannot effectively tie the order to the customer, but the other datapoints still exist and are still useful. For example, we could still form an understanding of total revenue over time because the sales amount and transaction date are present. The data has become the cup, the total revenue calculation, but the total revenue calcuation is made up of the individual data elements of all orders transacted against the database. You have changed the shape of the data, but you have not altered its essential qualities, the data are elemental.
Indeed, some of the most important work we do as data engineers is not only to identify what exists, but also to identify what is missing. We create the shape of the cup, but the cup is meaningless without data to fill it.
Flow and Crash
Lee concludes by telling us that "water can flow, or it can crash". Water is very powerful, always moving, and shapes our world. In the oceans water flows along currents from one end of the earth to another. At certain points the water crashes into cliff sides or onto beaches and can be extremely powerful. These flows and crashes can create breathtaking scenery and landscapes. Similarly, data flows from one source to a destination along given currents or pathways, occasionally can change shape, or change temperature, but it always remains essentially pure.
We build data flows using a number of tools from Spark to Airflow. We take the data in its current state and move it, transform it, enrich it or aggregate it, but never alter its essential qualities. Our goal is to move the data into its proper position without changing its elemental nature. The tools and techniques we use, such as chained transformations, change the shape of the data, or alter its flow, but are never used to misrepresent the data. Data are elements of information, and it is extremely powerful, but it exists and its essential qualities must be maintained. Data is a liquid.
Data flows must end somewhere and must take the shape of whatever destination it reaches. Water crashes into a cliff face or beach. Data crashes into our phones, or the company board meetings, or Bloomberg terminals. Water can crash violently and dramatically alter our world. Water can flow and bring nourishment to remote regions of the world, or it can bring warmer water and change the climate of regions. Data flows from customer orders on a website to a company sales dashboard, and quickly represents the success or failure of a campaign or initiative. Data flows from an influencer's camera and crashes onto our social media feeds, helping us to understand products and guiding our purchasing decisions. Data flows into a banker's dashboard and determines whether a family qualifies to purchase a home.
The flows and crashes of data have the power to shape our world. Not only does data shape our world, the data also eventually becomes our world.
Be Water, My Friend.