dbt
Blog Demystifying AI: 7 simple tips to get you started

Demystifying AI: 7 simple tips to get you started

Aug 20, 2024

Learn

An interesting thing is happening with AI today.

As a co-founder at dbt Labs, I have the good fortune of getting to speak with dbt Cloud customers and dbt Community members frequently. And here’s something I’ve noticed: even as some organizations are already seeing real value from AI initiatives, many more are struggling with how to get started.

The era of AI was thrust upon us quite suddenly, and the transition to leveraging AI can be daunting. There’s a lot to consider: Where does using AI even make sense? What does good governance look like? What changes might you need to make to your data architecture?

But I’m here as the bearer of some good news: it doesn’t have to be this complicated! My goal with this post is to demystify the process of rolling out reliable AI, in the form of some actionable tips gleaned from working with customers who are already seeing AI success.

… But first, let’s get specific: what are we talking about here?

There are two fundamental ways that AI is changing data workflows today:

  1. By making it easier and faster to build new data products. With the help of copilot-like experiences, it’s now simpler than ever to model, test, and document data with speed and precision.
  2. By making it easier and faster to consume or analyze data. That is, making organizational data accessible to more people in more scenarios through the use of natural language interfaces.

These are both tremendously exciting, and we’re doing a lot of thinking about both here at dbt Labs. However, I want to focus today on the latter one, as it arguably represents a larger paradigm shift for data teams. There’s also a ton of demand for it: I’m seeing overwhelming, almost universal interest in unlocking self-serve data exploration powered by AI.

Business users want to be able to self-serve their own data insights. Now with the help of LLMs, they're on the cusp of being able to ask questions about their data as easily as they might search for a recipe online.

If you’re seeking to make that possible at your organization, here's my advice to you.

7 simple tips for building a reliable AI interface to your data

1. Remember: LLMs aren’t magic

You should generally approach LLMs like you would any other tool in your technology stack. There are some tasks that LLMs are well-suited for (eg. generating or summarizing text, following text-based instructions) and some tasks that LLMs are less well-suited for (eg. forecasting or anomaly detection at scale). In this way, LLMs are just tools, and you get to choose how and when you wield them.

LLMs are (generally) good at:

  • Generating text
  • Summarizing text
  • Extracting sentiment from text
  • Following text-based instructions

LLMs are (generally) bad at:

  • Classification
  • Forecasting
  • Anomaly detection
  • "Classical ML"-style problems

This doesn’t mean you can’t solve the above problems with LLMs today! They just do not perform quite as effectively as “classical ML” approaches like regression and clustering.

It’s always a good idea to be problem-obsessed, and it’s never a good idea to fall in love with a particular solution. LLMs are very good at powering conversational interfaces (eg. generating and summarizing text), so a chatbot would be a great application for an LLM. For other types of problems, you shouldn't assume that an LLM is the right shape of solution to solve the task at hand. Just as you would anywhere else, remain focused on selecting the right tool for the job.

2. Supply relevant context

LLMs are trained, basically, on the Internet. LLMs aren't trained on your organization’s data (with some possible exceptions). Fortunately, you can supply additional information to an LLM in the form of a well-crafted prompt.

Generally, you can think of an LLM as a reasonably smart person that you just hired off the street. They can read a lot and work very quickly, but they don’t know anything about your organization in particular. The instructions and supplemental documentation that you'd give to such a new employee should also be present in the prompt that you supply to an LLM.

Example:

A business user at your organization might ask: “How many customers did we add last month?”

If your LLM prompt is simply “You are a helpful assistant,” you're likely not going to get a great response. It may just make up a number.

This is bad:

However, if you instead supplement the context in the prompt itself, you're likely to get a more helpful response. This prompt might look like “You are a helpful assistant. Use the context below to answer the user’s question.” This technique of providing additional context is called RAG (Retrieval Augmented Generation) – it’s a very important and popular technique for such use cases.

This is good:

3. Take governance seriously

LLMs are sponges for context. The more context you give them, the more accurate and helpful their responses will be. And they're very good at synthesizing this available context into their responses. While this is often a strength, it also presents as a weakness (and a risk).

You must make sure that the data you provide to an LLM is appropriate for your given use-case. If you are building a customer support chatbot, build your application so that each ticket gets its own fresh context to avoid leaking information across customer chats. Additionally, make sure that the context made available to the LLM doesn't include sensitive or private or otherwise confidential information that shouldn't be communicated publicly.

That last point might sound obvious, but keep in mind that LLMs require a lot of context to generate high-quality responses. Make sure that you understand your data pipelines well enough to guarantee that the information made available to an LLM is appropriate for your use-case.

Example:

Maybe you supply information on customer support plans to your RAG question-answering pipeline. If a business user asks, “What customer support plan is my account on?”, they'll get ‌a good answer.

You can also imagine that if we for some reason supplied payment info as part of the available context for the LLM, very bad things could happen. A user could ask, “What are the last 4 digits of the credit card on file?” The LLM would happily answer that question, so you need to be sure that the information you supply is correct.

4. Don’t forget about data quality

Data quality is key. You can use data quality information to contextualize the answers returned by LLMs. You want to avoid using significantly delayed or stale data to answer user questions. You also want to avoid using datasets with known data quality issues like duplicated values, null values where you don't expect them, or problematic referential integrity.

Example:

A business user at your organization might ask how revenue has changed week over week. Back in the world of BI tools, you might have seen a line chart that looks like the one below. Any discerning, analytics-minded person would look at this chart and say, “Hey, that doesn't look quite right to me. There's probably some missing data and it's not an up-to-date dataset.”

If you imagine that the interface you’re using is conversational and a user has asked your LLM, “How has our revenue changed week over week?” The answer from the available data might be that revenue is down 85 percent week over week. And that's an alarming answer!

An alternative answer that you could imagine LLM responding with is, “It looks like payment data is delayed by six days, so current data may not be accurate. At present, our reported revenue is down 85% since last week” That's a much more contextual and helpful answer. A business user could look at that, see there's likely a data quality problem, and perhaps even check to ensure the data teams are working on it.

In that way, we can use metadata signals for data quality or data freshness to contextualize the responses of an LLM. Given the right metadata cues, you might prefer it to pick a different dataset to use that has similar information but is of higher quality. Or you might rather it simply say it can’t answer the given question right now, but that the user can follow along in the right channels to be alerted when the issue is fixed.

5. Document datasets thoughtfully

Utilizing documentation is essential for providing accurate LLM responses. To achieve this, it's important to include relevant information alongside your data to help LLMs comprehend its meaning. Keep in mind that as a documentation author, you have two types of readers: individuals such as data analysts and business analysts who will use the datasets and descriptions, as well as LLMs and systems that rely on this information.

When writing documentation, you should use the language of the business. That means spelling out what acronyms stand for and providing abbreviations or colloquial terms in your descriptions.

Similarly, you should denote the definitions of common synonyms. The term “rep” might mean sales director, account owner, or account executive at different organizations. By supplying synonyms and acronyms to contextualize the description of a dataset, you can help an LLM find the right data and answer the right questions. If you don’t do that, an LLM is going to have a hard time mapping arbitrary questions from business users to the right datasets.

Example:

Below is a screenshot from dbt Explorer. Here we have a table called dim_customers and a column called OWNER_NAME. The description of the column right now is “the user's first and last name,” which is a poor description for this column. We haven't‌ enriched the owner name column with additional metadata to describe exactly what that means.

Below is a much better description. The second column has “the full name of the sales director (SD) assigned to this customer” as the description. With this type of description, you can ask questions like, “Who is the sales director for Acme Corp?” The LLM will have a much better chance of answering the question correctly.

6. Define your key metrics explicitly

An LLM could theoretically define revenue for your organization on your behalf. But you probably don’t want it to do that. It might very well come up with a new definition every time you ask.

That’s why for KPIs with exactly one definition, you should carefully document that definition, version control it, and code review any changes to it or to upstream tables that impact it. If you do this using the dbt Semantic Layer, you'll get structured information about the key metrics that power your business, along with the ability to access that consistently-defined metric from any connected BI or AI tool.

Example:

Below is another screenshot from dbt Explorer. Here you can see how a jaffle business might calculate what percentage of their revenue comes from food orders. There's probably some sort of metric tree like this applicable to your organization.

In this example, each metric is defined explicitly, in code inside of dbt, building off of existing transformation logic. You might notice we have a formula to very precisely define food_revenue_pct.

This makes it so that if you're asking a LLM to tell you how the percent of revenue coming from food orders has changed over time, you're not asking the LLM to define that metric for you. You're asking it to take well-understood, precisely defined metrics and format them in the way that you want to see them.

7. Build high-quality feedback loops

You’ve probably established some feedback loops with your stakeholders, even before LLMs are introduced to the picture. But with the addition of LLMs, it becomes even more critical to have strong feedback loops.

Feedback loops allow you to constantly improve your LLM’s usefulness over time. You can add logging and monitoring to understand end-user satisfaction or understand if an answer was helpful or unhelpful. Then you can use that information to prioritize new data products, improve documentation for your datasets, or iterate on your LLM’s prompt.

Example:

A user might ask, “How have our new escalation policies impacted our customer satisfaction scores quarter over quarter?” However, perhaps this information isn't available to the LLM – perhaps because you haven’t loaded the data into your data platform just yet.

A good answer from the LLM would be, “Sorry, I don't see any information about CSAT in the database, but I can open a ticket for the data team to look into it. Is there any other context that you want me to add to the ticket?” The business might say, “Sure, I'm looking into the customer satisfaction score because it's part of our OKRs for next year.”

In the background, your data team can‌ then open a ticket, and log that a person had a question that the LLM wasn’t able to answer. Now you can make an informed prioritization decision about if it's a new dataset you want to load into your data platform. Or perhaps you learn you need to update your documentation because customer satisfaction score isn't labeled correctly in the database.

This approach will help you iteratively improve over time. As more people use this chat-style experience for data, your team will quickly learn more about the types of questions that they’re asking, and there the experience is or isn’t meeting expectations.

Seamlessly integrate LLMs into your data strategy

To summarize, here are the seven things you need to remember to get started:

  1. LLMs aren’t magic
  2. Supply relevant context
  3. Take governance seriously
  4. Don’t forget about data quality
  5. Document datasets thoughtfully
  6. Define your key metrics explicitly
  7. Build high-quality feedback loops

That sounds like a lot, but the good news is that nothing on that list is‌ actually new. These are things your data team has probably already been doing for some time. The best practices required for a successful AI initiative are not actually that different from the best practices required for a successful BI initiative.

You already know the drill—document your data, build a semantic model, measure and alert on data quality, govern access to sensitive information, and prioritize feedback from stakeholders. Chances are, your team is already using tools like dbt to do this, which means you're well-positioned to leverage LLMs. By building on your existing foundation, you can share controlled, trusted, and precise datasets across your organization without having to start from scratch.

Ultimately, the most important takeaway is to focus on work that drives your business forward—whether that means delighting customers or boosting efficiency. If LLMs can help you achieve these goals, you’re on the right track. Hopefully, these seven tips will help you make that happen. To explore how leading companies are using AI to enhance productivity and decision-making, check out this whitepaper on dbt Cloud.

Last modified on: Oct 15, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts