Contract Language Reference
A Soda data contract is a YAML document that contains data quality checks. The checks in a Soda data contract are similar to unit tests for testing software. Verifying a contract means evaluating the checks against the actual data. If contract verification fails, notifications can be sent out or the new data can be stopped from being released to users.
This page documents the supported check types, filters, and configuration options available in Soda's Data Contract Language.
Contract YAML document structure
Example
Dataset fully qualified name
Every contract must have the fully qualified name
Casing has to match the casing in the data source.
A dataset fully qualified name is composed of 3 parts, all slash-separated:
The data source name: This is the name of the data source in which the dataset exists. The data source name is configured in a data source YAML configuration file or in Soda Cloud.
A list of prefixes: Prefixes represent the full list of hierarchical elements of the data source like for example, the database name and the schema name. Postgres has a database name and a schema name as prefixes. Databricks has a catalog and schema as prefixes. Prefix names are also slash-separated.
The dataset name: The name of the table or view.
Columns
A contract includes an optional list of columns. If a schema check is configured, the columns can be used to validate the presence, data type, and order of columns.
Each column entry includes:
name
: the column name (required)data_type
(optional): used by the schema checkcharacter_maximum_length
(optional): used by the schema checkchecks
(optional): a list of checks that apply only to that column
Example:
Checks
The checks
section defines the quality rules and expectations for your dataset. Checks can be written at two levels:
Dataset-level: Rules that apply to the dataset as a whole (e.g., row count, schema, freshness, duplicates across multiple columns).
Column-level: Rules that apply to a specific column (e.g., missing, invalid, duplicate on a single column).
Example of a dataset-level checks
block:
Example of column-level checks:
Each check defines what “good data” looks like, ensuring that your dataset meets the expected standards.
You can customize each check by adding filters, thresholds, or variable references depending on your specific use case. Learn more about the different check types on this page.
Schema check
A Schema check verifies that the structure of a dataset matches the expectations defined in the contract—such as required columns, data types, and optional constraints. It ensures that changes to the dataset schema (like missing columns, unexpected data types, or reordered fields) are detected early, helping maintain stability and trust in your data.
Example:
Column keys used by the schema check:
The schema check uses the columns as specified in the contract and compares them with the measured dataset columns as as the expected columns.
name
The name of the column. The name is case sensitive.
No
data_type
The data type of the column as is returned by the metadata from the data source. Data types are compared case insensitive.
Yes
character_maximum_length
The character maximum length. This represents the maximum length for data types like VARCHAR(255)
Yes
Check configuration keys:
allow_extra_columns
Specifies if columns other than the ones listed in the contract are allowed in the dataset. Values are true
or false
. Default is false, which means that all columns in the dataset must be specified in the contract.
Yes
allow_other_column_order
Specifies if column ordering must be exact the same as in the contract YAML document. Values are true
or false
. Default is false, which means that columns must appear in the order as in the contract.
Yes
name
Yes
qualifier
Yes
Row count check
A Row Count check verifies that the dataset contains the expected number of rows. It ensures that the data source has at least a minimum number of records and no unexpected gaps. This is a fundamental check to confirm the presence and completeness of data in a dataset.
Example:
Check configuration keys
name
Yes
threshold
Yes
filter
Yes
qualifier
Yes
Freshness check
A freshness check verifies if data is not too old by measuring the period between the current time and the most recent timestamp present in a given column.
**Example: **
Check configuration keys
column
Specifies a timestamp column that represents the time of the row or
No
unit
Specifies a the unit of measure for the time period between the current time and the most recent timestamp in the column. Values are hour
, minute
& day
Default threshold is 1 hour
Yes
name
Yes
threshold
Yes
filter
Yes
qualifier
Yes
Missing check
A Missing check verifies that a specific column does not contain null or empty values beyond an acceptable threshold. It ensures that critical fields are populated and helps catch incomplete or corrupted data that could impact downstream processes.
Example:
Check configuration keys
missing_values
The list of values is considered missing. NULL is always considered a missing value and doesn't need to be included
missing_format
A SQL regex that matches with missing values. (Advanced)
Yes
name
Yes
threshold
Yes
filter
Yes
qualifier
Yes
Configure extra missing values in a list
Configure a list of values that, apart from NULL, must also be considered as missing values. Typical examples are '-'
, 'No value'
, 'N/A'
, 'None'
, 'null'
, -1
, 999
The missing_values
configuration is only supported on string and numeric data types.
Configure extra missing values with a regular expression
Configure a SQL regular expression that, apart from NULL, also matches values that are considered as missing values. Only supported for columns having a text data type
The regex
syntax must match the SQL engine of the data source.
The name
is a human-readable description of what the regex does. It helps explain the purpose of the pattern, making it easier for others to understand the intent of the check.
Invalid check
An Invalid check verifies that the values in a column conform to a set of allowed formats, values, or constraints. It helps catch data that does not meet the expected standards, such as incorrect formats, out-of-range values, or entries that violate business rules.
Example:
Check configuration keys
valid_values
A list of valid values
No*
valid_format
A SQL regular expression that matches with valid values
No*
valid_reference_data
References a dataset and column that contains valid values
No*
valid_min
Valid values must be greater or equal than the configured value
No*
valid_max
Valid values must be less or equal than the configured value
No*
valid_length
The fixed length of valid values
No*
valid_min_length
The minimum length of valid values
No*
valid_max_length
The maximum length of valid values
No*
invalid_values
A list of invalid values
No*
invalid_format
A SQL regular expression that matches with invalid values
No*
name
threshold
Yes
filter
Yes
qualifier
Yes
There are several configuration keys to configure the invalid values. At least one validity configuration is required. Multiple validity configurations can be combined.
Example with a regular expression
Format is only supported for string data type.
The regex
syntax must match the SQL engine of the data source.
The name
is a human-readable description of what the regex does. It helps explain the purpose of the pattern, making it easier for others to understand the intent of the check.
Example with a reference dataset:
valid_reference_data
has 2 nested configurations:
dataset
: fully qualified dataset name. Limitation: The reference dataset must be in the same data source.column
: the name of the column containing the valid values
Example minimum and maximum values:
valid_min
and valid_max
can be used independently.
It's supported on columns with numerical data types.
Example minimum and maximum values length:
valid_length
, valid_min_length
and valid_max_length
can be used independently.
Duplicate check
A Duplicate check ensures that values in a column or combination of columns are unique, helping prevent data integrity issues. It can be used at the column level or dataset level.
Single column duplicate check
Configuration keys:
name
Yes
threshold
Yes
filter
Check filter
Yes
qualifier
Yes
Multi-column duplicate check
Configuration keys:
columns
List of column names
No
name
Yes
threshold
The default duplicate check threshold requires that all values are unique. Data Contract Language Reference ✅
Yes
filter
Yes
qualifier
Yes
Aggregate check
An Aggregate check verifies that the result of an aggregate function, such as avg
, sum
, min
, max
, or count
—meets the expected thresholds. It helps ensure that summary statistics over a column remain within acceptable ranges and that key metrics derived from your data stay accurate and consistent.
Example:
Check configuration keys
function
Supported by all data sources: avg
, sum
, min
, max
, count
, stddev
, variance
No
name
Yes
threshold
No
filter
Yes
qualifier
Yes
Metric check
A Metric check validates the result of a custom SQL expression or query against a threshold. It supports complex business logic, calculated metrics, or cross-column relationships. Metric checks can be defined at the dataset level (across multiple columns) or column level (single column).
For better performance, we recommend SQL expressions over full SQL queries—they’re simpler to write and can be combined with other checks in a single query.
Verify a SQL expression value against a threshold
Example at the dataset level:
Example at the column level:
Check configuration keys
expression
A numeric SQL expression that produces the numeric metric value that will be compared with the threshold. Either expression
or query
must be specified
No
name
Yes
threshold
No
filter
Yes
qualifier
Yes
Verify a SQL query value against a threshold
Example at the dataset level:
Example at the column level:
query
A numeric SQL query that produces a single numeric metric value that will be compared with the threshold. Either expression
or query
must be specified
No
name
Yes
threshold
No
qualifier
Yes
Failed rows check
A Failed Rows check identifies rows that violate a specific condition, such as a filter or SQL expression. It helps pinpoint problematic data like outliers or rule violations and can save references for further review.
Failed Rows checks can apply at the dataset level (testing conditions across multiple columns) or at the column level (validating individual column values).
We recommend using SQL expressions over full SQL queries for simplicity and efficiency—they allow combining multiple checks into a single query for better performance.
Verify failed rows with a SQL expression
Example at the dataset level:
Example at the column level:
Check configuration keys
Expression
A SQL query that produces any tabular data indicating failures with the data. A query returning zero rows indicates there is no problem with the data. If rows are returned, they can be visualized in Soda Cloud to help to diagnose the issue. Joins and Common Table Expressions (CTEs) are allowed. Either expression
or query
must be specified
No
name
Yes
threshold
Yes
filter
Yes
qualifier
Yes
Verify and visualize failed rows with a SQL query
**Example at the dataset level: **
**Example at the column level: **
Check configuration keys
Expression
A SQL query that produces any tabular data indicating failures with the data. A query returning zero rows indicates there is no problem with the data. If rows are returned, they can be visualized in Soda Cloud to help to diagnose the issue. Joins and Common Table Expressions (CTEs) are allowed. Either expression
or query
must be specified
No
name
Yes
threshold
Yes
filter
Yes
qualifier
Yes
Common check configurations
Configure a check filter
Most checks support a filter
to limit the rows on which the check applies:
Note: If a dataset filter is configured, both the dataset and the check filter will be applied.
Thresholds
Every check (except the schema check) has a metric value that is evaluated and produces a check outcome. The threshold is the check part that specifies which numeric values make the check pass or fail.
An example threshold is: "The missing count metric value must be less than 5."
Use one of the following threshold configuration keys to specify an open range.
must_be
The metric value must be equal to the configured value
must_not_be
The metric value must be different from the configured value
must_be_greater_than
The metric value must be greater than the configured value
must_be_greater_than_or_equal
The metric value must be greater than or equal to the configured value
must_be_less_than
The metric value must be less than the configured value
must_be_less_than_or_equal
The metric value must be less than or equal to the configured value
must_be_between
must_be_not_between
Use a closed range to ensure metric values are between 2 boundaries.
Example:
Use
greater_than
orgreater_than_or_equal
to specify the lower boundUse
less_than
orless_than_or_equal
to specify the upper bound
Use an open range to ensure metric values are outside 2 boundaries.
**Example: **
Use
less_than
orless_than_or_equal
to specify the lower boundUse
greater_than
orgreater_than_or_equal
to specify the upper bound
Set a threshold as a percentage
To specify a threshold as a percentage of the total, add metric: percent
to the threshold.
The metric
The property has 2 possible values: count
is the default and percentage
can be configured.
Percentage-based thresholds are only available on checks where the metric value can be expressed as a percentage of a total value. These are missing, invalid, duplicate, and failed_rows (expression) checks.
Note that the total value refers to the total number of rows checked. In case of a dataset filter or a check filter, the total refers to the number of rows passing the filters.
Example:
Check names
The name
property provides a human-readable description for a check. It helps explain the purpose of the check in plain language, making it easier for users to understand the intent behind each rule, especially when reviewing results in Soda Cloud.
For example, instead of relying on technical details like a regex pattern or threshold, a well-written name
offers a simple summary, such as:
In this example, the name
("Email format") clarifies the purpose of the regex. This is particularly helpful for non-technical users reviewing the contract or verification results.
Best Practices for Check Names:
Write check names in plain language.
Focus on the intent or business rule the check enforces.
Avoid repeating technical details already visible in the check configuration.
The name
Property is optional but highly recommended for improving clarity and collaboration.
By default, the check name is the check type.
Check qualifiers
When sending results to Soda Cloud, check qualifiers ensure that the checks in the source YAML document can be correlated with the checks in Soda Cloud.
Most checks don't need a qualifier because there is only one check in a checks
section of the same type. If you have multiple checks of the same type in a checks
section, you need to ensure that each check has a distinct string qualifier
value.
Keep in mind that if you change the qualifier, the historic information of this check on Soda Cloud is lost.
We want to ensure that all contracts are Soda Cloud ready so we have made it mandatory to specify qualifiers where needed.
In case two checks have the same identity, contract verification will complain with an error Duplicate identity
Configure distinct qualifiers to create unique check identities:
Any text or numeric value is allowed as the qualifier. No qualifier is considered different from other checks with a qualifier like in the example above.
Configure a filter for all checks
A dataset filter ensures that all checks in the contract are only applied to a subset of the rows in the dataset.
This is most commonly used to apply all checks to the latest time partition of the data. Each time new data is appended to an incremental dataset, the contract verification should evaluate the checks only on the rows in the latest time partition.
Make contracts dynamic with variables
Variables allow dynamic substitution of values in contracts. They help you:
Parameterize values that differ across environments, datasets, or schedules.
Reuse values in multiple places within the same contract to reduce duplication and improve maintainability.
Variables can be used in filters, checks, thresholds, and more.
Declare variables at the top of the contract:
The default value is optional. Variables without a default
are required when verifying a contract (see below).
If you don't specify a default value, don't forget the colon (:
) after the variable name.
Use variables in other places in the contract with syntax: ${var.VARIABLE_NAME}
Variables are case sensitive. We recommend always using upper case and underscores for variable names.
Specify variables values when verifying a contract:
Variables can be overridden at verification time:
Setting a variable overrides the default value configured in the contract.
Soda variables
${soda.NOW}
is a built-in variable that provides the current timestamp at the moment of scan execution, allowing you to create dynamic filters relative to the time of scan execution—for example, to check for records in the last 24 hours..
Schedule a contract verification
Soda Cloud can execute the contract verification on a time schedule.
First, configure the time schedule in the contract.
This defines a cron-based schedule to trigger contract verification
Requires the data source to be configured in Soda Cloud with Soda Agent
Second, publish the contract to Soda Cloud.
Last updated
Was this helpful?