Quickstart

This quickstart shows how Soda detects unexpected data issues by leveraging AI powered Anomaly Detection and prevents future problems by using data contracts directly in Databricks.

Scenario

This tutorial uses a demo dataset called regional_sales.

A data engineer at a retail company needs to maintain the regional_sales dataset so their team can manage regional sales data from hundreds of stores across the country. The regional_sales dataset feeds executive dashboards and downstream ML models for inventory planning. Accuracy and freshness are critical, so you need both:

Automated anomaly detection on key metrics (row counts, freshness, schema drift)
Proactive enforcement of business rules via data contracts

After signing up, you'll be guided through the setup flow with an in-product tour.

You can complete the setup by following the tour, or if you prefer, the same steps are described here in case you'd like to revisit them later.

Add a Data Source

Soda Cloud’s no-code UI lets you connect to any Unity-Catalog–backed Databricks SQL Warehouse in minutes.

In , click on Data Sources → + New Data Source

Name your data source "Databricks Demo" under Data Source Label
Switch to the Connect tab and fill in the following credentials to connect your Soda instance to Databricks:

Field

Value

Data Source Type

Databricks

Catalog

unity_catalog

Host

dbc-6f631120-27ee.cloud.databricks.com

HTTP Path

/sql/1.0/warehouses/005e2ef93ecca4b1

Token

${secret.demo_datasource_secret_name}

The secret is automatically created during account creation.

Click Connect. This will test the connection and move to the next step.
Select the datasets you want to onboard on Soda Cloud. In this case regional_sales.

Enable Monitoring and Profiling. By default, Metric Monitoring is enabled to automatically track key metrics on all the datasets you onboard and alert you when anomalies are detected, powered by built-in machine learning that compares current values against historical trends.

Click Finish to onboard your datasets. Soda Cloud will now spin up its Soda-hosted Agent and perform an initial profiling & historical metric collection scan. This usually takes only a few minutes.

Part 1: Review Anomaly Detection Results

Congratulations, you’ve onboarded your first dataset! Now let’s make sure you always know what’s happening with it.

That’s where Metric Monitoring comes in. It automatically tracks key metrics like volume, freshness, and schema changes, with no manual setup required. You’ll spot anomalies, detect trends, and catch unexpected shifts before they become problems.

Step 1: Open the Metric Monitors dashboard

Go to Datasets → select regional_sales.

We’ve already spotted something interesting, one of your metrics shows an unusual change compared to its historical trend. To learn more, navigate to the Metric Monitors tab.
You'll immediately see that key metrics are automatically monitored by default, helping you detect pipeline issues, data delays, and unexpected structural changes as they happen. No setup needed, just visibility you can trust.

Step 2: View “Partition row count” anomalies

In this guide, we will focus on the Partition row count. The panel shows that it was expected to be in a range of 7200 - 7210, but the recorded value at scan time was 6300. In order to take a closer look:

Click the Partition row count block.
In the monitor page you’ll see:
- measured value vs. expected range,
- any red-dot anomalies flagged by the model,
- buttons to Mark as expected, Create new incident, etc.
Flag an outlier as "expected" or investigate it further.

Soda’s engine was built in-house (no third-party libraries) and optimized for high precision. It continuously adapts to your data patterns, and it incorporates your feedback to reduce false alarms. Designed to minimize false positives and missed detections, it shows a 70% improvement in detecting anomalous data quality metrics compared to Facebook Prophet across hundreds of diverse, internally curated datasets containing known data quality issues.

The Anomaly Detection Algorithm offers complete control and transparency in the modeling process to allow for interpretability and adaptations. It features high accuracy while leveraging historical data, delivering improvements over time.

Part 2: Attack the Issues at Source (No-Code)

Our automated anomaly detection has just done the heavy lifting for you, identifying unusual patterns and potential data issues without any setup required.

But let’s make sure you’re not just catching issues after the fact, but defining exactly what your data should look like, every column, every rule, every expectation.

That’s where Data Contracts come in. They let you proactively set the standards for your data, so problems like this are flagged or even prevented before they impact your business.

Step 1: Create a Data Contract

Create a new data contract to define and enforce data quality expectations.

In the regional_sales Dataset Details page, go to the Checks tab.
Click Create Contract.

When creating a data contract, Soda will connect to your dataset and build a data contract template based on the dataset schema. From this point, you can start adding both dataset-level checks and column-level checks, as well as defining a verification schedule or a partition.

Toggle View Code if you’d like to inspect the generated SodaCL/YAML. This gives you access to the full contract code.

You can copy the following full example and paste it into the editor. You can toggle back to no-code view to see and edit the checks in the no-code editor.

dataset: databricks_demo/unity_catalog/demo_sales_operations/regional_sales
filter: |
  order_date >= ${var.start_timestamp}
  AND order_date < ${var.end_timestamp}
variables:
  start_timestamp:
    default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP))
  end_timestamp:
    default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP)) + INTERVAL '7 days'
checks:
  - row_count:
  - schema:
columns:
  - name: order_id
    data_type: INTEGER
    checks:
      - missing:
          name: Must not have null values
  - name: customer_id
    data_type: INTEGER
    checks:
      - missing:
          name: Must not have null values
  - name: order_date
    data_type: DATE
    checks:
      - missing:
          name: Must not have null values
      - failed_rows:
          name: Cannot be in the future
          expression: order_date > DATE_TRUNC('day', CAST('${soda.NOW} ' AS TIMESTAMP)) +
            INTERVAL '1 day'
          threshold:
            must_be: 0
  - name: region
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values:
            - North
            - South
            - East
            - West
          name: Valid values
  - name: product_category
    data_type: VARCHAR
  - name: quantity
    data_type: INTEGER
    checks:
      - missing:
          name: Must not have null values
      - invalid:
          valid_min: 0
          name: Must be higher than 0
  - name: price
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0
          name: Must be higher than 0
      - missing:
          name: Must not have null values
  - name: payment_method
    data_type: VARCHAR
    checks:
      - missing:
          name: Must not have null values
      - invalid:
          threshold:
            metric: count
            must_be: 0
          filter: region <> 'north'
          valid_values:
            - PayPal
            - Bank Transfer
            - Cash
            - Credit Card
          name: Valid values in all regions except North
      - invalid:
          name: Valid values in North
          filter: region = 'north'
          valid_values:
            - PayPal
            - Bank Transfer
            - Credit Card
          qualifier: ABC124

That’s right, with Soda, you can edit a contract using either a no-code interface or directly in code. This ensures an optimal experience for all user personas while also providing a version-controlled code format that can be synced with a Git repository.

Step 2: Publish & verify

Click Test to verify the contract executes as expected
When you are done with the contract, click Publish

Click Verify. Soda will evaluate your rules against the current data.

Step 3: Review check results

Review the outcomes of the contract checks to confirm whether the data meets expectations. You can drill into those failures in the Checks tab.

Part 3: Attack the Issues at Source (Code)

You can trigger contract verification programmatically as part of your pipeline — so your data gets tested every time it runs.

We’ve prepared an example notebook to show you how it works:

In your Python environment, first install the Soda Core library

pip install -i https://pypi.dev.sodadata.io/simple -U soda-core

Then create a file with your API keys, which are necessary to connect to Soda Cloud. You can create one from your Profile: Generate API keys

import os

soda_cloud_config = f"""
soda_cloud:
  host: beta.soda.io
  api_key_id: {os.getenv("api_key_id")}
  api_key_secret: {os.getenv("api_key_secret")}
"""

with open("soda-cloud.yml", "w") as f:
    f.write(soda_cloud_config)

print("✅ soda-cloud.yml written.")

Now you are ready to trigger the verification of the contract. To do that just provide the identifier of your dataset as well as the path to the configuration file you just created in the previous step. This will trigger a verification using Soda Agent and return the logs.

from soda_core import configure_logging
from soda_core.contracts import verify_contracts_on_agent

configure_logging(verbose=False)

res = verify_contracts_on_agent(
    dataset_identifiers=["databricks_demo/unity_catalog/demo_sales_operations/regional_sales"],
    soda_cloud_file_path="soda-cloud.yml",
)


print(res.get_logs())

You can learn more about the Python API here: Python API

You’ve completed the tutorial and are now ready to start catching data quality issues with Soda

What’s Next?

Explore Profiling in the Discover tab to curate column selections for deeper analysis.
Set up Notification Rules (bell icon → Add Notification Rule) to push alerts to Slack, Jira, PagerDuty, etc.
Dive into Custom Monitors via scan.yml or the UI for even more tailored metrics.