dbt
Blog Developing a modern data strategy for AI

Developing a modern data strategy for AI

Aug 01, 2024

Insights

More companies than ever have big plans for Generative AI (GenAI) projects. However, the success of these projects depends on high-quality and well-governed data. That’s leading to a shift in how data engineers, analytics engineers, and others across the enterprise engage with data.

Let's talk about current GenAI efforts, how it's changing the jobs of data and analytics engineers, and what I think are some of the main principles behind a successful GenAI data effort.

The evolving role of the data engineer

Data from TDWI provides some insight into who’s using GenAI currently - and how.

Around 50% of the companies TDWI surveyed aren’t using GenAI currently - they’re still focused on self-service reporting and analytics. Of the other 50%, 27% are applying Machine Learning and Natural Language Processing to enhance their analytics. Another 23% are on the cusp of predictive analytics - i.e., using current data to forecast future outcomes.

TDWI’s data also shows a lot of excitement around GenAI - i.e., using training data from foundation models (like ChatGPT) combined with a company’s own data sets to create new output. Teams are looking to GenAI for a range of use cases. Top examples include customer support chatbots, generating marketing content, writing code (a la GitHub Copilot), and onboarding new employees.

To achieve this, companies are using more than one data platform. Many are combining traditional relational DBMSes and data pipelines with data warehouses (cloud and on-prem), data lakes, and data lakehouses.

This push to GenAI, combined with the growth of data in general, has led to a lot of changes in the data-facing engineering roles. Along with data engineers, data scientists and MLOps engineers have joined the fray to assist in building out AI frameworks and the associated data pipelines. Meanwhile, the analytics engineer has risen to assist in creating clean data sets for end users, combining data transformation and coding skills with domain-level knowledge of a business’s data.

The role of the data engineer - now and in the future

In recent years—thanks largely to increased specialization in the data space—data engineers have been able to shift from focusing on cranking out data transformation pipelines. Instead, they’ve focused more on creating the self-service infrastructure that other roles - analytics engineers, MLOps engineers, business analysts, etc. - can use to source data, create reports, and answer their own data-related questions.

However, with the explosion in GenAI, we’re seeing a bit of a reversion. Many data engineers surveyed by TDWI report say they spend a lot of their time getting data into date warehouses, migrating data into data lakes, integrating with vendor data, or working on BI and analytics projects for their end users.

I think we’re in a transition phase where data engineers are being asked to get data into places where it can be used, for example, as context retrieved through mechanisms such as Retrieval Augmented Generation (RAG). Over time, we’ll see data engineers more focused on providing AI-oriented data in a self-service fashion to data consumers, rather than being focused on building data pipelines.

A lot of work in RAG right now involves curation of structured data. Most of this data transformation work continues to happen in SQL and Python. Given the ubiquity of SQL as a data access and transformation language, there’s a lot you can do to expose this data to data consumers in a way that’s easier for those without a deep knowledge of GenAI to access.

For example, say that the business has a need to extract sentiment from a set of survey responses. Data engineers can build SQL user-defined functions in platforms like Databricks and Snowflake that expose sentiment analysis functions via a SQL query. Analytics engineers can then use these functions in their queries to drive reports and apps that are useful to the business.

Another way I see the data engineering role evolving is in enabling the move of GenAI apps from prototypes to production. With the currently evolving crop of tools, it’s pretty easy to throw together a toy demo of how you expect a GenAI app to work. But there’s a huge gulf between that and getting the app in front of millions of users with the performance and reliability you need.

I don’t see this as a major shift for data engineers in terms of their existing skillset, though. There’s a lot we can do as an industry to de-mystify some of the concepts around GenAI. People use concepts like “in-context learning” which sound ominous and foreboding at first - but just end up meaning something simple (i.e., “add the context you retrieved as text to your LLM prompt”).

A lot of what might first sound complex and hard to grok in GenAI just ends up being machine learning in a trenchcoat. More than ever, we’ll need the data engineer’s grounding in fundamental information retrieval techniques to help architect our GenAI apps so they’re production-ready.

The role of data products in enabling collaboration and governance

When we talk to customers about GenAI, we get one of two reactions. Half of our customers ask us pointedly, “Where is GenAI on your roadmap?” The other half want us to sign a contract promising we will never, under any circumstances, turn on any GenAI for their datasets.

This is a phenomenon we’ve seen time and again with data. We saw it when cloud data storage first became popular and companies worried about shared compute and storage models. It’ll take some time for some companies—and their security teams—to become okay with the notion of GenAI features operating over their data.

That leads us to the larger questions of data governance and data collaboration with GenAI. Having data engineers, data scientists, MLOps engineers, and analytics engineers all in the mix creates several challenges, including:

  • Enabling these roles to collaborate seamlessly—particularly, enabling downstream engineers to consume the work of the data team
  • Providing a way for data consumers (MLOps engineers, data scientists, etc.) to find existing, high-quality data sets so they don’t reinvent the wheel
  • Preventing compliance issues unique to AI—e.g., enabling an LLM to make inferences based on non-deterministic attributes such as race, gender, or age

I don’t feel, however, that most of these challenges are particularly unique to GenAI.

To be sure, there are some issues of which we need to be acutely aware. Data engineers need to remain very careful about what data they supply to GenAI engines or RAG databases. In this sense, good data governance techniques - anonymization, strong and consistent data labeling and classification, etc.—are more important than ever. It’s hard work. It’s time-consuming. And it’s absolutely essential.

However, in another sense, there’s nothing new here. The best way to get great results from GenAI is the same way you get great results from any data-driven project: produce high-quality data. This means going back to your data sources and ensuring that your data is sound. Do you know where it’s coming from? Is it reliable? It also means having robust data quality tests and metrics to ensure your data meets your criteria for accuracy, completeness, consistency, and timeliness.

One way to ensure both quality and collaboration is through data products. With data products, data engineers can treat a data set like a software engineering team would a software release, defining versioned, documented data contracts for each new iteration. Teams can then discover and use these data sets - e.g., one team can use the data set for BI or analytics, while another uses it for Machine Learning.

I see data products playing an essential role in enabling data producers and consumers to work together. That’s why we’ve built support for data products directly into dbt via features like data contracts and why we support data discovery and lineage via features such as dbt Explorer.

Like I said, there’s nothing new here. These are solved problems. We don’t need to resolve them for GenAI—we just have to apply the lessons we’ve already learned.

GenAI is changing the way that we interact with data and, along with it, the role of the data engineer. The good news is that most data engineers already have the skills and tools to help companies manage this evolution. With high-quality data, a focus on governance, and a collaboration mindset, companies can create a strong data foundation on which they can build the next generation of GenAI apps.

Get started on developing a modern data strategy for AI with this on-demand webinar.

Last modified on: Oct 15, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts