Add snapshots to your DAG
Related documentation
What are snapshots?
Analysts often need to "look back in time" at previous data states in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt provides a mechanism, snapshots, which records changes to a mutable tableIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. over time.
Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an orders
table where the status
field can be overwritten as the order is processed.
id | status | updated_at |
---|---|---|
1 | pending | 2019-01-01 |
Now, imagine that the order goes from "pending" to "shipped". That same record will now look like:
id | status | updated_at |
---|---|---|
1 | shipped | 2019-01-02 |
This order is now in the "shipped" state, but we've lost the information about when the order was last in the "pending" state. This makes it difficult (or impossible) to analyze how long it took for an order to ship. dbt can "snapshot" these changes to help you understand how values in a row change over time. Here's an example of a snapshot table for the previous example:
id | status | updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|
1 | pending | 2019-01-01 | 2019-01-01 | 2019-01-02 |
1 | shipped | 2019-01-02 | 2019-01-02 | null |
In dbt, snapshots are select
statements, defined within a snapshot block in a .sql
file (typically in your snapshots
directory). You'll also need to configure your snapshot to tell dbt how to detect record changes.
It is not possible to "preview data" or "compile sql" for snapshots in dbt Cloud. Instead, run the dbt snapshot
command in the IDE by completing the following steps.
When you run the dbt snapshot
command:
- On the first run: dbt will create the initial snapshot table — this will be the result set of your
select
statement, with additional columns includingdbt_valid_from
anddbt_valid_to
. All records will have adbt_valid_to = null
. - On subsequent runs: dbt will check which records have changed or if any new records have been created:
- The
dbt_valid_to
column will be updated for any existing records that have changed - The updated record and any new records will be inserted into the snapshot table. These records will now have
dbt_valid_to = null
- The
Snapshots can be referenced in downstream models the same way as referencing models — by using the ref function.
Example
To add a snapshot to your project:
- Create a file in your
snapshots
directory with a.sql
file extension, e.g.snapshots/orders.sql
- Use a
snapshot
block to define the start and end of a snapshot:
{% snapshot orders_snapshot %}
{% endsnapshot %}
- Write a
select
statement within the snapshot block (tips for writing a good snapshot query are below). This select statement defines the results that you want to snapshot over time. You can usesources
andrefs
here.
{% snapshot orders_snapshot %}
select * from {{ source('jaffle_shop', 'orders') }}
{% endsnapshot %}
-
Check whether the result set of your query includes a reliable timestamp column that indicates when a record was last updated. For our example, the
updated_at
column reliably indicates record changes, so we can use thetimestamp
strategy. If your query result set does not have a reliable timestamp, you'll need to instead use thecheck
strategy — more details on this below. -
Add configurations to your snapshot using a
config
block (more details below). You can also configure your snapshot from yourdbt_project.yml
file (docs).
$ dbt snapshot
Running with dbt=1.8.0
15:07:36 | Concurrency: 8 threads (target='dev')
15:07:36 |
15:07:36 | 1 of 1 START snapshot snapshots.orders_snapshot...... [RUN]
15:07:36 | 1 of 1 OK snapshot snapshots.orders_snapshot..........[SELECT 3 in 1.82s]
15:07:36 |
15:07:36 | Finished running 1 snapshots in 0.68s.
Completed successfully
Done. PASS=2 ERROR=0 SKIP=0 TOTAL=1
-
Inspect the results by selecting from the table dbt created. After the first run, you should see the results of your query, plus the snapshot meta fields as described below.
-
Run the
snapshot
command again, and inspect the results. If any records have been updated, the snapshot should reflect this. -
Select from the
snapshot
in downstream models using theref
function.
select * from {{ ref('orders_snapshot') }}
- Schedule the
snapshot
command to run regularly — snapshots are only useful if you run them frequently.
Detecting row changes
Snapshot "strategies" define how dbt knows if a row has changed. There are two strategies built-in to dbt — timestamp
and check
.
Timestamp strategy (recommended)
The timestamp
strategy uses an updated_at
field to determine if a row has changed. If the configured updated_at
column for a row is more recent than the last time the snapshot ran, then dbt will invalidate the old record and record the new one. If the timestamps are unchanged, then dbt will not take any action.
The timestamp
strategy requires the following configurations:
Config | Description | Example |
---|---|---|
updated_at | A column which represents when the source row was last updated | updated_at |
Example usage:
Check strategy
The check
strategy is useful for tables which do not have a reliable updated_at
column. This strategy works by comparing a list of columns between their current and historical values. If any of these columns have changed, then dbt will invalidate the old record and record the new one. If the column values are identical, then dbt will not take any action.
The check
strategy requires the following configurations:
Config | Description | Example |
---|---|---|
check_cols | A list of columns to check for changes, or all to check all columns | ["name", "email"] |
The check
snapshot strategy can be configured to track changes to all columns by supplying check_cols = 'all'
. It is better to explicitly enumerate the columns that you want to check. Consider using a surrogate keyA surrogate key is a unique identifier derived from the data itself. It often takes the form of a hashed value of multiple columns that will create a uniqueness constraint for each row. to condense many columns into a single column.
Example Usage
Hard deletes (opt-in)
Rows that are deleted from the source query are not invalidated by default. With the config option invalidate_hard_deletes
, dbt can track rows that no longer exist. This is done by left joining the snapshot table with the source table, and filtering the rows that are still valid at that point, but no longer can be found in the source table. dbt_valid_to
will be set to the current snapshot time.
This configuration is not a different strategy as described above, but is an additional opt-in feature. It is not enabled by default since it alters the previous behavior.
For this configuration to work with the timestamp
strategy, the configured updated_at
column must be of timestamp type. Otherwise, queries will fail due to mixing data types.
Example Usage
Configuring snapshots
Snapshot configurations
There are a number of snapshot-specific configurations:
Configuration best practices
Use the timestamp
strategy where possible
This strategy handles column additions and deletions better than the check
strategy.
Ensure your unique key is really unique
The unique key is used by dbt to match rows up, so it's extremely important to make sure this key is actually unique! If you're snapshotting a source, I'd recommend adding a uniqueness test to your source (example).
Snapshot query best practices
Snapshot source data.
Your models should then select from these snapshots, treating them like regular data sources. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data
Use the source
function in your query.
This helps when understanding data lineageData lineage provides a holistic view of how data moves through an organization, where it’s transformed and consumed. in your project.
Include as many columns as possible.
In fact, go for select *
if performance permits! Even if a column doesn't feel useful at the moment, it might be better to snapshot it in case it becomes useful – after all, you won't be able to recreate the column later.
Avoid joins in your snapshot query.
Joins can make it difficult to build a reliable updated_at
timestamp. Instead, snapshot the two tables separately, and join them in downstream models.
Limit the amount of transformation in your query.
If you apply business logic in a snapshot query, and this logic changes in the future, it can be impossible (or, at least, very difficult) to apply the change in logic to your snapshots.
Basically – keep your query as simple as possible! Some reasonable exceptions to these recommendations include:
- Selecting specific columns if the table is wide.
- Doing light transformation to get data into a reasonable shape, for example, unpacking a JSONJSON (JavaScript Object Notation) is a minimal format for semi-structured data used to capture relationships between fields and values. blob to flatten your source data into columns.
Snapshot meta-fields
Snapshot tablesIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. will be created as a clone of your source dataset, plus some additional meta-fields*.
Starting in 1.9 or with dbt Cloud Versionless, these column names can be customized to your team or organizational conventions via the snapshot_meta_column_names
config.
Field | Meaning | Usage |
---|---|---|
dbt_valid_from | The timestamp when this snapshot row was first inserted | This column can be used to order the different "versions" of a record. |
dbt_valid_to | The timestamp when this row became invalidated. | The most recent snapshot record will have dbt_valid_to set to null . |
dbt_scd_id | A unique key generated for each snapshotted record. | This is used internally by dbt |
dbt_updated_at | The updated_at timestamp of the source record when this snapshot row was inserted. | This is used internally by dbt |
*The timestamps used for each column are subtly different depending on the strategy you use:
For the timestamp
strategy, the configured updated_at
column is used to populate the dbt_valid_from
, dbt_valid_to
and dbt_updated_at
columns.
Details for the timestamp strategy
For the check
strategy, the current timestamp is used to populate each column. If configured, the check
strategy uses the updated_at
column instead, as with the timestamp strategy.