Blog Understanding AI data engineering

Understanding AI data engineering

Sep 07, 2024

The work of data engineers has changed radically in the past 10 to 15 years. The advent of the cloud and cloud-based data warehouses such as Amazon Redshift, the shift from ETL to ELT, and the birth of data lakes and data warehouses mean that data engineering teams can process and handle larger volumes of data faster than ever before.

On the other hand, data engineers still struggle to meet the demand for large-volume, high-quality data. This pressure has only grown stronger with the emergence of Generative AI (GenAI) use cases, which require large and accurate data sets to produce useful results.

The good news is that we can use GenAI itself to help meet this demand.

GenAI is poised to disrupt how data engineers work by automating many routine and even complex analytics workflow tasks. Implemented correctly, engineers can use AI-assisted workflows to ship more data quality products, more quickly, and with higher quality.

We’ll look at what AI data engineering is, how it can improve every stage of your data workflows, and the processes and systems you need to make it successful.

What is AI data engineering?

AI data engineering leverages GenAI technology to assist in the creation or modification of various assets associated with your data workflows. It combines the power of Large Language Models (LLMs), AI models trained on massive amounts of data, with data from your existing pipelines—database schemas, data models, tests, documentation, metrics—to produce a first draft of artifacts based on natural language descriptions that your engineers can refine, test, and deploy.

Raw data is never ready out-of-the-box for use in analytics or AI workloads. It requires transformation, cleaning, testing, and documenting to mold it into a format suitable for asking questions and driving business decisions.

Creating these data pipelines and the infrastructure that supports them is the heart of data engineering. This work can frequently become a chokepoint for creating new production-ready data sets, as it takes time to create and test a high-quality data pipeline.

AI data engineering cuts this workload by automating the creation of assets—often a tedious task—that constitute a data pipeline. It doesn’t replace a data engineer; it augments them. Think of AI data engineering less as a robot and more as a cyberskeleton or a mecha suit that enhances the powers of its wearer.

Similar uses of GenAI in other areas of software engineering have yielded amazing results. For example, Github found that 55% of engineers who used its Copilot feature reported faster task completion, with 50% reporting faster time-to-merge their changes to ready them for deployment.

How AI data engineering assists the analytics code process

AI data engineering can provide a boost to data engineers, analytics engineers, analysts, and business users in five critical areas:

Coding
Testing
Documentation
Metric and semantic models
Discovering data

Let’s look at each in detail.

Coding

Data transformations require selecting data from one or more sources and reshaping it into a format suitable for querying for a specific business use case. This involves writing code—usually in SQL or Python—that integrates data from multiple tables into fewer tables, as well as corrects any underlying issues - malformed fields, missing values, etc.

As in software engineering, AI data engineering can provide a boost by generating the base SQL statements a data engineer needs for their transformations. Engineers can generate simple or complex SQL statements, complex regular expression patterns, and even perform bulk edits on existing SQL or Python code.

This assistance can be particularly helpful for junior engineers. But it’s beneficial for even senior engineers who need to perform complex queries. By issuing instructions in natural language and allowing AI to write the code, engineers don't have to waste time looking up and reminding themselves of the peculiarities of SQL syntax.

Testing

Sadly, code doesn't always work the way we intend it to. Additionally, it may work in normal conditions but encounter issues when dealing with edge cases—for example, values outside of expected ranges, malformed values, etc.

Building tests for data transformation code that make assertions about the resulting data provides confidence that our transformations work as expected under a variety of circumstances. These tests can take several forms:

Unit tests (Validate small portions of data model logic)
Data tests (Ensure generated data is sound - e.g., all required fields have values, no value is malformed, etc.)
Integration tests (Test the entire project end-to-end)

Testing is one of those areas in software engineering that gets the short shrift come crunch time. Everyone knows they should be doing it, but the overhead involved means it sometimes gets left out in the rush to ship.

AI data engineering can generate basic tasks for a new or revised data model, eliminating much of this upfront coding. That reduces the psychological barriers to creating an adequate test bed. It also frees engineers to focus on refining or adding elements to the tests that bring true value to the quality of each data set.

Documentation

Documentation is another one of those assets that everyone knows they should write but sometimes don't. And that’s a shame because good documentation is critical for the discoverability and usability of datasets.

Documentation tells downstream consumers where data comes from, how to use it, and how various calculations were derived. This makes data more usable and provides confidence in its validity and accuracy.

If you use dbt for your data transformations, you already get some documentation for free in the form of automated data lineage generation. This is useful for tracing the origin of data back to its source, which increases data confidence and also assists in troubleshooting issues.

AI data engineering goes further, creating descriptions for your tables in their field based on their name, context, and similar data assets in your projects. This is extremely useful when you have hundreds of fields to document.

GenAI-generated documentation can provide an initial first cut of descriptions for all tables and fields. Engineers can then check these into source control, where they and other team members can gradually improve the documentation over time.

Metrics and semantic models

Another area where AI can help is in defining consistent metrics. Providing global metrics available across your organization for key values—e.g., revenue—makes it easier for everyone to access key data without re-writing the wheel or introducing subtle errors.

A semantic layer is a framework for metrics that defines a common representation of your data using standard business terminology, translating it from SQL or Python into common business language. Besides ensuring consistency, this democratizes access to data by making key values available to all data stakeholders.

Defining a semantic layer requires tools for defining and exposing new metrics globally. dbt Cloud’s Semantic Layer is one implementation of this concept that supports defining both metrics and semantic models that ship with your data transformations. With AI data engineering, you can not only generate these models automatically—you can ask the GenAI engine to recommend useful metrics based on your data transformation definitions.

Discovering data

Not all business stakeholders have an advanced or even basic knowledge of languages like SQL. That can prevent them from self-servicing answers to questions they have about data, or limit what they can do in visual BI reporting tools.

AI data engineering can help here as well. Using AI, any stakeholder can query their data, not with a programming language, but with simple natural language expressions. The AI engine handles converting these into the required code under the hood. This further democratizes data and reduces the support requests that stakeholders make to the engineering team.

Benefits and challenges of AI data engineering

AI data engineering can markedly decrease the time required to create new data assets. By using natural language queries, engineers of all levels can spend less time looking up programming syntax and hunting down the source of minor formatting issues in their code.

AI data engineering also breaks down barriers to implementing artifacts—such as tests and documentation—that are an essential part of data quality. GenAI can take on the grunt work involved in these tasks, leaving human engineers to focus on those parts where they can truly add value.

Additionally, an AI data engineering solution that uses your organization’s data as input can produce more consistent output. A GenAI engine can be instructed, for example, to follow certain best practices when generating SQL code.

The challenge is that AI isn’t a silver bullet. AI-generated code isn’t always accurate. Sometimes, in fact, it’s dead wrong. One developer, for example, found only one LLM could successfully generate working code for a battery of tasks, such as creating a Wordpress plugin or finding a bug.

Another study found that, while AI can help enforce best practices in some areas, it might encourage engineers to skip following them in other areas, such as security.

An AI data engineering assistant (with guardrails)

An AI copilot can be a revolutionary addition to data engineering as part of an end-to-end, mature analytics workflow process, such as the Analytics Development Lifecycle (ADLC). Processes like the ADLC ensure alignment with business objectives and establish checkpoints to ensure code quality and conformance to best practices.

dbt Cloud acts as your data control plane, providing a consistent approach to developing, testing, deploying, and documenting data in line with the ADLC, no matter where your data lives. Using dbt Cloud, you can:

Model your data using a combination of SQL and YAML code
Promote it to production with a rigorous testing and verification process supported by the platform
Monitor the performance and usage of your data

dbt Copilot integrates with every step of your data engineering workflow, using your own data—its relationships, metadata, and lineage—to automate routine tasks and implement routine tasks like testing, documentation, and SQL formatting that are essential for delivering high-quality data products.

Besides generating artifacts for your data pipelines, dbt Copilot can enforce code consistency using a custom style guide. You can use Copilot out of the box with OpenAI or bring your own OpenAI key.

Ask us for a demo to learn more about how dbt Cloud with dbt Copilot can transform your data engineering practice.

Last modified on: Mar 28, 2025

Early Bird pricing is live for Coalesce 2025

Save $1,100 when you register early for the ultimate data event of the year. Coalesce 2025 brings together thousands of data practitioners to connect, learn, and grow—don’t miss your chance to join them.

Save $1,100 — Register Now

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now

Great data professionals never work alone

Every industry leader understands one thing: you need the right network to grow. The dbt Community connects you with 100,000+ data professionals—people who share your challenges, insights, and ambitions.

If you’re looking for trusted advice, expert discussions, and real career growth, this is the place for you.

Join the dbt Community Learn more

Solve your toughest challenges

Join today and get real-world advice from experienced pros.

Expand your network

Foster connections with meetups, local groups, and like-minded peers.

Advance your career

The dbt community is full of learning opportunities and shared job postings.