Quickstart
Last updated
Was this helpful?
Last updated
Was this helpful?
This quickstart shows how Soda detects unexpected data issues by leveraging AI powered Anomaly Detection and prevents future problems by using data contracts directly in Databricks.
This tutorial uses a demo dataset called regional_sales
.
A data engineer at a retail company needs to maintain the regional_sales
dataset so their team can manage regional sales data from hundreds of stores across the country. The regional_sales
dataset feeds executive dashboards and downstream ML models for inventory planning. Accuracy and freshness are critical, so you need both:
Automated anomaly detection on key metrics (row counts, freshness, schema drift)
Proactive enforcement of business rules via data contracts
Sign up for free on
After signing up, you'll be guided through the setup flow with an in-product tour.
You can complete the setup by following the tour, or if you prefer, the same steps are described here in case you'd like to revisit them later.
Soda Cloud’s no-code UI lets you connect to any Unity-Catalog–backed Databricks SQL Warehouse in minutes.
In , click on Data Sources → + New Data Source
Name your data source "Databricks Demo" under Data Source Label
Switch to the Connect tab and fill in the following credentials to connect your Soda instance to Databricks:
Data Source Type
Databricks
Catalog
unity_catalog
Host
dbc-6f631120-27ee.cloud.databricks.com
HTTP Path
/sql/1.0/warehouses/005e2ef93ecca4b1
Token
${secret.demo_datasource_secret_name}
Click Connect. This will test the connection and move to the next step.
Select the datasets you want to onboard on Soda Cloud. In this case regional_sales
.
Enable Monitoring and Profiling. By default, Metric Monitoring is enabled to automatically track key metrics on all the datasets you onboard and alert you when anomalies are detected, powered by built-in machine learning that compares current values against historical trends.
Click Finish to onboard your datasets. Soda Cloud will now spin up its Soda-hosted Agent and perform an initial profiling & historical metric collection scan. This usually takes only a few minutes.
Congratulations, you’ve onboarded your first dataset! Now let’s make sure you always know what’s happening with it.
That’s where Metric Monitoring comes in. It automatically tracks key metrics like volume, freshness, and schema changes, with no manual setup required. You’ll spot anomalies, detect trends, and catch unexpected shifts before they become problems.
Go to Datasets → select regional_sales
.
We’ve already spotted something interesting, one of your metrics shows an unusual change compared to its historical trend. To learn more, navigate to the Metric Monitors tab.
You'll immediately see that key metrics are automatically monitored by default, helping you detect pipeline issues, data delays, and unexpected structural changes as they happen. No setup needed, just visibility you can trust.
In this guide, we will focus on the Partition row count. The panel shows that it was expected to be in a range of 7200 - 7210, but the recorded value at scan time was 6300. In order to take a closer look:
Click the Partition row count block.
In the monitor page you’ll see:
measured value vs. expected range,
any red-dot anomalies flagged by the model,
buttons to Mark as expected, Create new incident, etc.
Flag an outlier as "expected" or investigate it further.
Soda’s engine was built in-house (no third-party libraries) and optimized for high precision. It continuously adapts to your data patterns, and it incorporates your feedback to reduce false alarms. Designed to minimize false positives and missed detections, it shows a 70% improvement in detecting anomalous data quality metrics compared to Facebook Prophet across hundreds of diverse, internally curated datasets containing known data quality issues.
The Anomaly Detection Algorithm offers complete control and transparency in the modeling process to allow for interpretability and adaptations. It features high accuracy while leveraging historical data, delivering improvements over time.
Our automated anomaly detection has just done the heavy lifting for you, identifying unusual patterns and potential data issues without any setup required.
But let’s make sure you’re not just catching issues after the fact, but defining exactly what your data should look like, every column, every rule, every expectation.
That’s where Data Contracts come in. They let you proactively set the standards for your data, so problems like this are flagged or even prevented before they impact your business.
Create a new data contract to define and enforce data quality expectations.
In the regional_sales
Dataset Details page, go to the Checks tab.
Click Create Contract.
When creating a data contract, Soda will connect to your dataset and build a data contract template based on the dataset schema. From this point, you can start adding both dataset-level checks and column-level checks, as well as defining a verification schedule or a partition.
Toggle View Code if you’d like to inspect the generated SodaCL/YAML. This gives you access to the full contract code.
You can copy the following full example and paste it into the editor. You can toggle back to no-code view to see and edit the checks in the no-code editor.
That’s right, with Soda, you can edit a contract using either a no-code interface or directly in code. This ensures an optimal experience for all user personas while also providing a version-controlled code format that can be synced with a Git repository.
Click Test to verify the contract executes as expected
When you are done with the contract, click Publish
Click Verify. Soda will evaluate your rules against the current data.
Review the outcomes of the contract checks to confirm whether the data meets expectations. You can drill into those failures in the Checks tab.
You can trigger contract verification programmatically as part of your pipeline — so your data gets tested every time it runs.
We’ve prepared an example notebook to show you how it works:
In your Python environment, first install the Soda Core library
Then create a file with your API keys, which are necessary to connect to Soda Cloud. You can create one from your Profile: Generate API keys
Now you are ready to trigger the verification of the contract. To do that just provide the identifier of your dataset as well as the path to the configuration file you just created in the previous step. This will trigger a verification using Soda Agent and return the logs.
You can learn more about the Python API here: Python API
You’ve completed the tutorial and are now ready to start catching data quality issues with Soda
Explore Profiling in the Discover tab to curate column selections for deeper analysis.
Set up Notification Rules (bell icon → Add Notification Rule) to push alerts to Slack, Jira, PagerDuty, etc.
Dive into Custom Monitors via scan.yml
or the UI for even more tailored metrics.
Open the following Notebook example: