dbt
Blog How to hire data engineers

How to hire data engineers

Apr 30, 2024

Learn

I find myself regularly having conversations with analytics leaders who are structuring the role of their team’s data engineers according to an outdated mental model. This mistake can significantly hinder your entire data team, and I’d like to see more companies avoid that outcome.

This post discusses when, how, and why you should hire data engineers as a part of your team.

What is a data engineer?

Data engineers are the people who move data from outside of your ecosystem into your ecosystem. They are responsible for your infrastructure and data plumbing.

Responsibilities for these folks might include: keeping your dbt instance on the latest version, managing snowflake permissions, managing and writing airflow pipelines, and maintaining the CI/CD pipeline for your repo.

What does a data engineer do?

Even with the availability of new tools that empower data analysts and scientists to build self-service pipelines, data engineers are still a critical part of any high-functioning data team. However, the tasks they should focus on have changed, as has the sequencing in which you hire them. I’ll discuss the “when” question in a later section; for now, let’s talk about what data engineers are responsible for on modern startup data teams.

Instead of building ingestion pipelines that are available off-the-shelf and implementing SQL-based data transformations, here’s what your data engineers should be focused on:

  • managing and optimizing core data infrastructure,
  • building and maintaining custom ingestion pipelines,
  • supporting data team resources with design and performance optimization, and
  • building non-SQL transformation pipelines.

Managing and optimizing core data infrastructure

While data engineers no longer need to manage Hadoop clusters or scale hardware for Vertica at VC-backed startups, there is still real engineering to do in this area. Making sure that your data technology is operating at its peak results in massive improvements to performance, cost, or both. That typically involves:

  • building monitoring infrastructure to give visibility into the pipeline’s status,
  • monitoring all jobs for impact on cluster performance,
  • running maintenance routines regularly,
  • tuning table schemas (i.e. partitions, compression, distribution) to minimize costs and maximize performance, and
  • developing custom data infrastructure not available off the shelf.

You can get most of your core infrastructure off-the-shelf today, but someone still needs to monitor it and make sure it’s performing. And if you’re truly a cutting-edge data organization, you’ll likely want to push the boundaries on existing tooling. Data engineers can help with both.

Build and maintain ingestion pipelines

While data engineers no longer need to hand-roll Postgres or Salesforce data transport, there are “only” about 100 integrations available off-the-shelf from the modern data integration vendors. Most of the companies we work with have off-the-shelf coverage of between 75 and 90% of the data sources they work with.

In practice, integrations are implemented in waves. Typically, the first phase includes core application database and event tracking, with the second phase including marketing systems like an ESP and advertising platforms.

These first two phases are available completely off the shelf today. Once you go deeper into your more domain-specific SaaS vendors, you’ll need data engineers to build and maintain these more niche data ingestion pipelines.

Supporting data team resources with design and performance optimization for SQL transformations

One of the shifts we’ve seen in data engineering in the past five years is the rise of ELT: the new flavor of ETL that transforms the data after it’s been loaded into the warehouse instead of before. This shift has a tremendous impact on who builds these pipelines.

This shift to ELT means that data engineers don’t have to build most data transformation jobs. It also means that data teams without any data engineers can still get a long way with data transformation tools built for analysts. Data engineers still have a meaningful role to play in building these transformation pipelines, however. There are two key areas where data engineers should get involved:

  1. When performance is critical. Sometimes business logic requires some particularly heavyweight transformation, and it’s helpful to have a data engineer involved to assess the performance implications of a particular approach to building a table. Many analysts aren’t deeply experienced with performance optimization within MPP analytic databases and this is a great opportunity for collaboration with someone more technical.
  2. When code gets complicated. Analysts are great at answering business questions using data but frequently aren’t trained to think about how to write extensible code. It’s very easy to start building tables in your warehouse and have the entire project get out of hand quickly. Get a data engineer involved thinking through the overall architecture of your warehouse and doing design reviews on particularly pernicious transformations or you’ll find yourself with a spaghetti bowl to clean up.

Build non-SQL transformation pipelines

While SQL can natively accomplish most data transformation needs, it can’t handle everything. One common need is to do geo enrichment by taking a lat/long and assigning a particular region. At the moment, this is not widely supported on modern MPP analytic databases (although this is starting to change!), so the best answer is often to write a Python-based pipeline that augments the data in your warehouse with region information.

The other obvious use case for Python (or other non-SQL languages) is for algorithm training. If you have a product recommender, demand forecast model, or churn prediction algorithm that takes data from your warehouse and outputs a series of weights, you’ll want to run that as a node at the end of your SQL-based DAG.

Most companies that are running either of these types of non-SQL workloads today are using Airflow to orchestrate the entire DAG. dbt is used for the SQL-based portion of the DAG and then non-SQL nodes are added on at the end. This approach gives a best-of-both-worlds outcome where data analysts can still be primarily responsible for the SQL-based transformations while data engineers can be responsible for production-grade ML code.

Data engineer vs. Machine learning engineer

Data engineers build and maintain the infrastructure that allows organizations to collect, process, and store large-scale data efficiently. For example, a data engineer might develop a pipeline that pulls customer interactions from a website, cleans the data, and loads it into a warehouse like Snowflake for analysis.

ML engineers use this structured data to develop, train, and deploy machine learning models. Their focus is on turning raw data into predictive insights, such as building a recommendation system that suggests products based on customer behavior.

The skill sets for these roles differ. Data engineers work with ETL/ELT workflows, data pipelines, and infrastructure tools like Apache Spark, Airflow, and dbt. ML engineers specialize in model training, feature engineering, and deployment using TensorFlow, PyTorch, and MLFlow.

When does my team need a Data Engineer?

This change in role also informs a rethinking of the sequencing of data engineer hires. The previously accepted wisdom was that you needed data engineers first because data analysts and scientists had nothing to work with if there wasn’t a data platform in place. Today, data analysts and scientists should self-serve and build the first version of their data stack using off-the-shelf tools. Hire data engineers as you start hitting scale points:

  • Scale point#1: consider hiring your first data engineer when you have 3 data analysts/scientists on your team.
  • Scale point #2: consider hiring your first data engineer when you have 50 active users of your BI platform.
  • Scale point #3: consider hiring your first data engineer when the biggest table in your warehouse hits 1 billion rows.
  • Scale point #4: consider hiring your first data engineer when you know you’ll need to build 3 or more custom data ingestion pipelines over the next few quarters and they’re all mission-critical.

The key thing to realize is that data engineers don’t provide direct business value—their value comes in making your data analysts and scientists more productive. Your data analysts and scientists are the ones working with stakeholders, measuring KPIs, and building reports and models—they’re the ones helping your business make better decisions every day.

Hire data engineers to act as a multiplier to the broader team: if adding a data engineer will make your four data analysts 33% more effective, that’s probably a good decision. Data engineers deliver business value by making your data analysts and scientists more productive.

Whom should you hire?

As the role of the data engineer changes, so too does the profile of the ideal candidate. My esteemed colleague Michael Kaminsky put it better than I ever could in an email we exchanged on this topic, so I’ll quote him here:

“The way I think about this shift is a change in data engineering’s role on the team. It’s gone from a builder-of-infrastructure to a supporting-the-broader-data-team role. That’s actually a pretty huge shift, and one that some data engineers (who want to focus on building infrastructure) aren’t always excited about. I actually think this is important for startups to appreciate: they need to hire a data engineer who is excited about building tools for the analytics / DS team. If you hire a data engineer who just wants to muck around in the backend and hates working with less technical folks, you’re going to have a bad time. I look for data engineers who are excited to partner with analysts and data scientists and have the eye to say “What you’re doing seems really inefficient, and I want to build something to make it better."

I could not agree more with this sentiment. The best data engineers at startups today are support players that are involved in almost everything the data team does. They should be excited about that collaborative role and motivated to make the entire team successful.

How to write job descriptions for data engineers

Hiring, like sales and marketing, is all about the funnel; you are selling candidates on the opportunity of joining your team.

Very strong job descriptions are a crucial first step. I recommend job descriptions have five parts:

  • Background on the role;
  • Requirements;
  • Responsibilities;
  • Hiring process; and
  • A 30/60/90 day plan (How You’ll Ramp).

Many job descriptions don’t have these things, but candidates really appreciate them.

For data roles, it’s a job seeker’s market. Investing in thorough job descriptions will help you stand out from the crowd and help ensure a strong candidate pipeline.

Overview and background

Give the candidate an opportunity to understand the company, your goals for this role, and how they will fit into the team.

What does the team already have and what is the need that you are filling? Is this someone who’s going to focus on a specific domain or subject area? Let them know up front.

This is also a great time to paint a picture of your stack and sell the candidate on your business.

Requirements

What are the hard requirements for your role? (Hint: a college degree shouldn’t be one.) Do you need someone with experience with certain technologies or frameworks? Your requirements list should be as specific as needed but should not be a laundry list.

For example, if you use Airflow, you don’t need to have Airflow experience as a requirement, but you might decide that orchestrator experience needs to be, so a candidate who has used Luigi, Prefect, or Dagster is also one you’d consider. If that’s the case, call out “Experience with data orchestration tools” instead of “Experience with Airflow.”

Try to keep your list of requirements to 5 to 10 bullet points. Fewer actual requirements is better than a lot of fungible requirements. If you have additional “Nice to Haves,” make that a separate list. Women are less likely to apply for jobs if they feel they don’t meet all of the requirements. Help ensure a strong, diverse pipeline by keeping your list of requirements to only requirements.

Responsibilities

What are the things that a candidate will actually do if they move into this role? Try to be as specific as possible. This is your opportunity to paint a picture for a candidate.

I always mention in interviews that we are looking for “floor sweepers” — people who are not afraid to pick up a broom and sweep a pile of dust on the floor if it’s in front of them, even though it’s not in their job description.

A list of responsibilities is not a list of all the things you will be doing, but this is your opportunity to present what an exciting role this will be.

Hiring process

You should tell candidates upfront exactly what the steps in the interview process are. Let me emphasize: You should tell candidates upfront exactly what the steps in the interview process are. Nobody likes to be in the dark.

Tell candidates exactly how many calls they need to do, how long they will be, and who they will be with.

Is there a technical assessment? Include that information too. If you cannot write this before posting a role, you have not spent enough time thinking through your hiring process.

How you'll ramp (30/60/90)

Starting a new job is nerve-wracking. Laying out a 90-day plan on how candidates will ramp into a role helps establish standards for performance. It affirms to the candidate that you have thought about what success looks like and helps set their expectations.

As in data projects, the time to set clear measures of success is before you invest time and energy, not after. Hiring is no joke and is not a small amount of effort, but investing in the process up front is something that will pay long-term dividends.

Data engineer vs. ML engineer job descriptions

A data engineer primarily focuses on designing, building, and maintaining data pipelines that collect, transform, and store data efficiently for use across an organization. In contrast, an ML engineer specializes in developing, deploying, and optimizing machine learning models, often relying on the data infrastructure established by data engineers to ensure model performance and scalability.

The key distinction lies in their objectives: data engineers are responsible for enabling clean, reliable data access, while ML engineers apply that data to build and operationalize machine learning solutions. Additionally, data engineers emphasize skills in tools like Spark, Kafka, and SQL, whereas ML engineers prioritize expertise in algorithms, model training, and frameworks such as TensorFlow or PyTorch.

Example of data engineer job description

Job Overview

Included in all roles. This is specific to your company, your team, and your needs. This is your opportunity to sell yourself to the candidate.

Requirements

  • Experience creating production-grade ELT pipelines in Python
  • Hands-on experience with data orchestrators (we use Airflow)
  • Excellent written communicator who will enable async work

Responsibilities

  • Evolve our CI/CD strategy on the data team’s code base
  • Guide and implement architectural improvements to our data infrastructure
  • Maintain our Airflow infrastructure and ensure efficiency in our orchestration processes

Hiring Process

  1. Hiring Manager Resume Review
  2. Recruiter Screen (1 hr)
  3. Hiring Manager Interview (1 hr)
  4. Technical Assessment (done on own time, asked to limit to 4 hours) + 1 min 5. Technical Review with peer, scheduled upon submission
  5. Peer Interview (1 hr)
  6. Executive Interview (30 mins to 1 hr)

How you’ll ramp

  • 30 days: Be in the on-call rotation with named support
  • 60 days: Be contributing to internal conversations on data organization and structure
  • 90 days: Rolled out your first pipeline to support your team members with a new data source

Example of ML engineer job description

Job Overview

Included in all roles. This is specific to your company, your team, and your needs. This is your opportunity to sell yourself to the candidate.

Requirements

  • Experience with supervised and unsupervised ML and deep learning frameworks like Scikit-learn, TensorFlow, Keras, PyTorch
  • Strong understanding of data and analytics, including experience with Big Data, real-time, and batch data processing is preferred.

Responsibilities

  • End-to-end ownership of the engineering cycle, including deployment, for ML-based initiatives
  • Work directly with development teams to guide how features get built from ideation to production

Hiring Process

  1. Hiring Manager Resume Review
  2. Recruiter Screen (1 hr)
  3. Hiring Manager Interview (1 hr)
  4. Technical Assessment (done on own time, asked to limit to 6 hours) + 1 min Technical Review with peer, scheduled upon submission
  5. Peer Interview (1 hr)
  6. Executive Interview (30 mins to 1 hr)

How you’ll ramp

  • 30 days: Comfortable deploying data that can be used by developers for feature engineering
  • 60 days: Helping guide workflows that move ML models from idea to production
  • 90 days: Have a model you worked on be driving a feature that’s now in production

How dbt can help

Hiring data engineers is a critical step in building a scalable, high-performing data team, and signing up for dbt Cloud is a great first step to learning why they're so valuable.

Experience how dbt Cloud empowers data engineers with a collaborative environment for developing, testing, and deploying analytics code.

Get a free 14-day trial for your team today.

Last modified on: Feb 11, 2025

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts