Data Engineering In The Age of AI
By: Brady Bastian
Data engineering is reaching an inflection point as more artificial intelligence (AI) tools emerge. These tools provide software developers, including data engineers, with a valuable method to accelerate code development and enhance engineering outcomes. Although AI-generated code is generally proficient, it seldom encompasses the entirety of a problem and seldom achieves absolute accuracy. The data engineering profession must evolve with this technology to optimize its effectiveness. Rather than avoiding or completely rejecting the AI trend, data engineers should embrace the technology and strive to enhance its implementation and capabilities. It is often these engineers who construct the datasets necessary to power AI solutions. Data engineers must immerse themselves in understanding how to support AI infrastructure both presently and in the future. This proactive approach ensures that data engineers not only maintain their relevance but can also ascend to leadership positions within the field.
To Make Better AI Products You Will Need Better Humans
Which came first, the chicken or the egg? This old trope presents a unique philosophical question on the notion of causality. Likewise, we can ask: which came first: the AI or the user? As AI products become more sophisticated and users become more engaged a unique and synergistic relationship arises. Both the user and AI begin to train each other to operate more effectively together. Products engage with their users to present more meaningful content that is relevant to the user, think Netflix's recommendation system. As the user interacts with the AI, the AI presents better results. An AI psychologist can help the recommendation system producer better results by better interpreting human choices and behaviors.
The AI psychologist can help devise better ways that AI can interpret human behavior in nuanced and unique ways. This can go beyond simple feature engineering and involve actual inference of unknown features based upon an inference of human behavior. For example, a user might watch a sci-fi movie, such as Lucy. The recommender can interpret this as, "maybe they like Scarlett Johannson", maybe they like Sci-Fi, maybe they like action movies, etc., but if this data is also combined with other sources, such as a user's education, then a clearer picture might emerge. If the user also watches the Matrix, and anime like Ghost in the Shell, then that data can infer that the user is most interested in sci-fi and the user might be interested in adjacent titles, such as Dark City or Clockwork Orange. As the users engage the AI more, the results get better and can adapt as users try new things.
With self-driving cars, the problems become more nuanced and difficult to control while simultaneously the stakes are much higher. A great example of this is the interaction of a self-driving car at a busy stop sign with many pedestrians trying to cross. The car is in a very difficult situation here. On one hand, it is absolutely vital to not hit a pedestrian who is trying to cross the road. On the other hand, if the car doesn't assert itself it might never be able to move through the intersection. This means that the optimal solution to this problem requires the car to not only precisely calculate the trajectory of the humans attempting to cross, but also, the car must examine the behavior of the humans and interpret whether the human will stop if the car tries to proceed. The AI must infer human intent by examining human behavior. An AI psychologist is required to achieve this in a consistent, reliable, and accurate manner. The AI psychologist will train the AI on how to interpret human behavior, and also, to train the AI on how to best demonstrate its own intention to humans so that it can be a useful participant on the road.
This can extend to LLM's such as ChatGPT. Humans should be able to interact with ChatGPT and train it to produce better results. For example, if a particular user is looking for more dry answers, such as scientific facts, then it wouldn't make sense for ChatGPT to attempt to provide a "flavorful" method of returning results. Likewise, if you are using ChatGPT to act as a liaison with a company's customer service office, then a more personable tone would provide a higher degree of service. An AI psychologist would be in a good position to train these models to behave appropriately while also providing a high quality service or product. The AI psychologist would be instrumental in training Chat AI's to fulfill various services and achieve the goals of the customer.
To Behave Like a Human, You Must Think Like One
Generative AI tools based upon Large Language Models, such as ChatGPT, are becoming a vital source of information for many users. LLMs work by using highly sophisticated pattern matching techniques to interpret a user input and generate a result which is rooted in, ideally, highly accurate and curated training data (such as peer-reviewed scientific literature). Additionally, LLM's, such as ChatGPT, are also programmed to sound definitive and authoritative in their response to queries.
Programmers and engineers, such as myself, have turned to ChatGPT to get some quick hints about coding to solve particular problems. At this point, it isn't good enough to program highly advanced and never before seen features. It lacks creative potential at this point. But, it is very good at relaying tried and true solutions to common problems which can be vital when you're programming and just need a quick reference. This does come at a price, however, ChatGPT can make mistakes. In fact, it makes so many mistakes that ChatGPT has been banned from Stack Overflow.
Stack Overflow posts this justification:
The primary problem is that while the answers which ChatGPT and other generative AI technologies produce have a high rate of being incorrect, they typically look like the answers might be good and the answers are very easy to produce. There are also many people trying out ChatGPT and other generative AI technologies to create answers, without the expertise or willingness to verify that the answer is correct prior to posting.
Humans are, inherently, lazy, and this is a problem when using tools like ChatGPT to accomplish tasks. Indeed, colleges are starting to become increasingly frustrated with AI tools because students are using ChatGPT to finish work assignments and even generate college entrance essays. This is obviously, a form of cheating, but how can you tell when a student is using AI or not? It is definitely not easy. Ironically, the best method for catching students using AI to generate answers is to use AI to detect AI generated solutions. Building AI detection tools is becoming just as valuable as building the AI tools themselves. But, what happens when a student is flagged for cheating when he was not?
Students are doing themselves and society a disservice by relying solely on LLMs to fabricate solutions or work items. By relying heavily upon machines to do everything for you aren't you becoming a machine yourself? Is AI programming higher-order thinking out of human consciousness? Is our reliance upon AI becoming a substitute for actual intelligence?
In order to build better AI detection tools you need to understand how LLM models like ChatGPT think. You need to be able to recognize that an answer was generated by an AI, and not a human. But, LLMs are designed to be as close to human dialog as possible. This, therefore, takes a highly trained and specialized person to make an accurate judgement of whether a piece of text was generated by an AI or not. You need an AI Whisperer.
Your Next Recall Will Send Your Car To A Shrink
We are fast approaching a point where AI products will become more impactful on our daily lives and even become vital to them. Tesla, for example, recently switched their full self driving stack to an AI-only approach focused on image recognition and video object detection, and essentially eliminated all of their C++ code base. The results have been a level of performance impossible to achieve with coded algorithms alone. And, while we focus on the engineering aspect of the cars and the neural networks, another vital need has arisen, the need to understand what the car was thinking when it chose to make a certain decision. This requires not only an engineering and data science mindset, but it also requires a psychological understanding of what the car was thinking when it made a decision. This psychological analysis will be used to identify holes in the car's training, or logic, and will form the basis of training scenarios for the car to go through. This analyses are so important that we believe that level 4 or 5 autonomy will be impossible to achieve without it.
In 2022 the National Highway Traffic Safety Administration (NHTSA) noticed that Tesla vehicles were rolling through stop signs and demanded a recall a fix to the problem. The thing is, though, is that humans never do this either. It is rare at best for most people to come to a stop at a stop sign, especially when there are no other cars around. In fact, coming to a full stop at a stop sign can sometimes draw the ire of other people because, ironically, it is unexpected behavior. Humans don't stop at stop signs because they are balancing their need for safety against their need to get where they're going. Experienced drivers often take these (illegal) shortcuts because they know that their chances of getting in a accident (or getting pulled over by law enforcement) in these scenarios is extremely low.
Self-driving cars, however, can be programmed to react to stop signs in a predictable way, and therefore, the NHTSA concluded, they must be programmed to obey the law without question. And, indeed, Tesla issued a recall to fix the bug, which was essentially an over the internet software patch sent to FSD vehicles. In the case of using C++ it could be possible to say: if stop sign, then stop. Sure, but what about an AI-first scenario in FSD 12?
Now we're in a different world, an AI-first world. What if the car detects that a vehicle behind them is speeding towards it without stopping? In this case, wouldn't it be smarter for the car to blow the stop sign to prevent the accident? Such a scenario requires an understanding of an ambiguous and chaotic world and it requires using judgement to come to a reasonable and justified conclusion. Tesla FSD uses video of "safe drivers" to train FSD 12. But, if humans rarely ever stop for stop signs, then you're in fact training FSD Beta to be less human like by programming it to stop at stop signs. Additionally, where are you going to find the video clips needed to train the model if safe humans rarely stop to begin with? Well, Tesla has a solution for this, as well, generate them on the fly. Use generative AI to "imagine" what stopping at the stop sign would look like, produce thousands of examples of this scenario, and then use these imagined scenarios to train the model. You are no longer programming AI, you are training it using psychological manipulation to behave outside of the norm in order to fit a rule that no one follows customarily. You are brainwashing your AI.
But, it works. Tesla FSD's stop sign behavior is better than ever before because of these techniques. Passengers are comfortable, if a little annoyed, and the NHTSA is satisfied by the car's behavior. We are now one small step closer to true level 4 or 5 autonomy.
The AI Psychologist Is The Most Valuable Position That Doesn't Exist Yet
Almost all industry and education systems focus solely on how to develop neural networks, deep learning, MlOps, advanced matrix and tensor mathematics, and other engineering essential to developing AI systems, but very few focus on the psychology of intelligent systems. We believe this to be the next frontier of AI development. AI Psychologists will make AI products smarter, more efficient, and more powerful. As AI become more a part of our daily lives the AI Psychologist will ease that transition and improve the human-AI relationship. AI is powerful, and therefore, it must also be responsible. The AI Psychologist will be necessary to train AI to recognize the impacts of its decisions. This goes far beyond the so called "trust and safety" guardrails that govern corporate AI today, and instead, the AI Psychologist will train AI to recognize the impacts of any results returned, based not only upon certain criteria outlined in corporate governance mandates, but also because AI can actually recognize human impact of the results it generates.
The best AI Psychologists will not only provide a better connection between the users and the AI, but it will also improve the connection between the user and the company. This will improve customer service and drive revenue. Additionally, AI psychologists will make AI safer and improve trust among regulators and corporate leadership. The AI Whisperers will not just be a nice to have for these companies, it will soon become essential for companies developing advanced AI products. We see an entire cottage industry arising from education in AI Psychology as well, and more advanced degree programs being offered along side data science and engineering degrees. AI Psychologists will provide complementary, and essential, services alongside data engineering when developing AI products and services. If the engineer is the brain of the AI, then the AI Whisperer would undoubtedly, be the heart.
What is an AI Psychologist?
An AI psychologist is part data scientist, part psychologist, and part engineer. He is someone who is an expert in behavioral analytics, data analysis, and psychological analysis. An AI psychologist combines these skills to provide a detailed psychological analyses of AI behavior and activities. These analyses are vital to determine flaws in AI logic and recommend better solutions to training the AI system. AI psychologists are able to intuit AI behavior and recommend improvements to AI which will lead to better products.
In 2024, AI is rapidly becoming an intrinsic part of our daily lives and our world. We are interacting with AI in many different ways each day from AI generated search results to AI generated images and video to self-driving cars. When we think of how and why these AI are programmed and maintained we often think of a wonky computer scientist who creates extremely complicated neural networks which are used to recreate a solution based upon prompts or inputs from users. Think generative AI to produce images of almost anything on demand or general adversarial networks to create life like AI images of people who never even existed.
Inflection Point
The discipline of data engineering and its associated infrastructure is experiencing substantial advancements that are analogous to, or even surpassing, those observed in other engineering fields. The efficacy of Artificial Intelligence (AI) is contingent upon substantial datasets derived from a myriad of sources. As a result, there is an increasing investment in the development of datasets meticulously optimized for AI performance. This emergent emphasis underscores a growing divergence in the skill sets requisite for data engineers tasked with the creation of these specialized datasets.
AI-optimized datasets exhibit significant differences from those designed for human analytical purposes. Although AI possesses the capability to adapt to varying ingestion formats, the processing of non-optimized data structures necessitates augmented computational resources and larger quantities of data. This scenario precipitates elevated costs and diminishes returns on novel AI initiatives, thereby posing a risk to entire projects. Data engineers are increasingly responsible for the compilation of essential datasets and providing the requisite expertise to actualize AI solutions. The principal challenge resides in grasping the intricate requirements necessary for the effective development of AI-optimized datasets.
The complexity inherent in servicing advanced AI models exacerbates these challenges. An erroneous value within a matrix computation can engender cascading effects throughout a neural network, thus compromising the integrity of complete answer sets. Identifying the underlying root cause of such issues proves challenging, culminating in prolonged exchanges between data engineers and data scientists. This dynamic not only hampers technical progression and efficiency but also harbors the potential for team or project disintegration.
To mitigate these challenges, it is imperative to incorporate highly proficient data engineers as integral stakeholders in AI projects. It is crucial for data scientists and management to acknowledge the importance of treating the AI lifecycle as a holistic ecosystem comprised of various essential elements. Employing the analogy of the human body, one may conceive the AI project as an integrated entity. In this context, data science functions as the brain while data engineering constitutes the remainder of the body. The brain devises strategies and comprehends the tools and scientific principles required to navigate the world, whereas data engineering operates akin to the heart of the project, circulating information to the brain and ensuring its survival and optimal functionality.
Elite athletes harness their cognitive faculties to devise efficacious strategies while simultaneously recognizing the need to nurture the muscles and organs essential for executing those strategies. Similarly, a robust heart is imperative for the mind to act upon its acquired knowledge. Hence, a symbiotic relationship between data scientists and data engineers is indispensable for the successful execution of AI projects.
Better Outcomes Thru Collaboration
Currently, the majority of data science projects fail to reach production. What accounts for this disconnect? Is it that stakeholders are unable to recognize flawed ideas, or is it that good ideas falter due to fundamental yet nuanced issues? It is our belief that data science projects predominantly fail not due to poor conceptual development but due to deficiencies in execution. This process begins with data, the essential component of any project.
Data scientists and project managers often perceive data engineers as mere components within the project, tasked with the laborious and less glamorous work of integrating, cleaning, and assembling data before the more exciting tasks can commence. They view data engineers as another tool that accepts instructions, performs operations, and reports results. This perspective is akin to regarding the heart merely as an organ that performs a function without acknowledging its critical importance. Just as professional athletes recognize the heart's significance to their success, so too should data engineers' roles be appreciated within data science projects.
Instead of providing arbitrary and cryptic instructions to data engineers, integrating them fully into the data science project from inception to production is a superior approach. Data engineers possess intimate knowledge of the datasets, including the nuances, gaps, capabilities, and potential inherent in the data under consideration. They understand whether the parameters of a project are feasible given the constraints of the data and corporate infrastructure. Their expertise is not merely beneficial but essential in determining the viability and predictability of a project.
A highly trained data engineer will evaluate the potential project's feasibility within the data context and be instrumental in developing the AI-optimized datasets necessary to power the project. The role is evolving to combine technical, intellectual, and process-oriented pillars, creating the necessary data flow to power the project effectively and efficiently. Without this critical component, the project's success is unattainable.
Heart and Mind
The integration of artificial intelligence (AI) into data analysis has been recognized as an efficient method for processing large datasets. Nonetheless, the accuracy of AI-generated analysis is contingent upon the quality and currency of the underlying data. Thus, the efficacy of AI systems is predicated on the incorporation of high-quality data from diverse and timely sources. This analogy likens data to the blood necessary for cerebral function, emphasizing the importance of robust data to AI performance.
Analogously, the role of the data engineer in the development of AI solutions and modern data architectures is akin to the function of the heart in the human body. Data engineers are fundamental to the lifecycle of AI projects, providing indispensable expertise in data management and integration. The notion of treating data engineers as ancillary components, easily interchangeable, is flawed. Similar to the complexities and risks associated with cardiac transplants, replacing key data engineering personnel mid-project can jeopardize the project's success.
The intrinsic value of the lead data engineer within AI initiatives necessitates a consideration comparable to the strategic importance of the heart in the human body. The process of substituting a data engineer is not only costly but also poses significant risks to project continuity and integrity. The assimilation of a new data engineer demands substantial time and resources, potentially impairing the project's momentum and increasing the likelihood of failure. An optimal approach to maintaining project efficacy involves fostering the development and optimization of existing data engineering resources. Drawing parallels to the rigorous training of professional athletes, enhancing the capabilities of data engineers can substantially improve project outcomes. A project wherein data engineers are fully integrated and regarded as essential contributors is more likely to achieve its full potential, reflecting the symbiotic relationship between the components of a well-functioning system.