Be first to try Soda's new AI-powered metrics observability, and collaborative data contracts.
Try Soda Now!
LogoLogo
  • What is Soda?
  • Quickstart
  • Data Observability
    • Metric Monitoring dashboard
      • Dataset monitors
      • Column monitors
    • Metric monitor page
  • Data Testing
    • Git-managed Data Contracts
      • Install and Configure
      • Create and Edit Contracts
      • Verify a contract
    • Cloud-managed Data Contract
      • Author a Contract in Soda Cloud
      • Verify a contract
  • Onboard datasets on Soda Cloud
  • Manage Issues
    • Organization dashboard
    • Browse Datasets
    • Dataset dashboard
    • Browse Checks
    • Check and dataset attributes
    • Analyze monitor and check results
    • Notifications
    • Incidents
  • Dataset Attributes & Responsibilities
  • Deployment options
    • Deploy Soda Agent
      • Deploy a Soda Agent in a Kubernetes cluster
      • Deploy a Soda Agent in an Amazon EKS cluster
      • Deploy a Soda Agent in an Azure AKS cluster
      • Deploy a Soda Agent in a Google GKE cluster
      • Soda Agent Extra
  • Organization and Admin Settings
    • General Settings
    • User management
    • User And User Group Management with SSO
    • Global and Dataset Roles
    • Integrations
  • Integrations
    • Alation
    • Atlan
    • Metaphor
    • Purview
    • Jira
    • ServiceNow
    • Slack
    • MS Teams
    • Webhook
  • Reference
    • Generate API keys
    • Python API
    • CLI Reference
    • Contract Language Reference
    • Data source reference for Soda Core
    • Rest API
    • Webhook API
Powered by GitBook

What is Soda?

Soda helps data teams deliver trustworthy data by making it easy to detect, investigate, and resolve data issues.

You can use Soda to:

  • Monitor production data with automated, ML-powered observability that surfaces unexpected changes without needing to define every rule up front.

  • Define data contracts, making expectations explicit and enabling producers and consumers to collaborate on reliable data at the source.

  • Test data earlier in the pipeline, as part of CI/CD workflows or during development, to prevent bad data from reaching production.

Soda enables teams to catch issues early, resolve them faster, and build confidence in data across the organization.

What is data quality?

Data quality refers to how well a dataset meets the expectations of completeness, accuracy, timeliness, uniqueness, and consistency. High-quality data supports business goals, drives confident decision-making, and underpins successful data products.

Poor data quality causes failed pipelines, incorrect reports, and broken AI models. Managing data quality means proactively validating assumptions and reactively monitoring for drift or degradation.

Soda helps you answer questions like:

  • Is the data fresh and complete?

  • Are there unexpected values or duplicates?

  • Did values shift outside of expected ranges?

  • Are schema or contract changes causing breakage?

  • Are data quality metrics changing over time?

Key Concepts

Data Observability

Data observability is a reactive approach to monitoring data in production and catching unexpected issues as they emerge. It helps answer the question: What is happening with my data right now, and how is that changing over time?

Use data observability to:

  • Detect anomalies in data quality metrics such as freshness, row counts, null values or custom ones

  • Monitor metric trends and seasonality

  • Identify late-arriving or missing records

  • Get alerted when values deviate from historical norms

Data Testing

Data testing is a proactive approach that validates known expectations about your data during development, deployment, or transformation. It helps you catch issues before they reach production, break reports, or impact downstream systems.

Use data testing to:

  • Align on what “good data” looks like through data contracts

  • Verify that your data meets those expectations, including schema, values, and transformations

  • Test data at every step of the pipeline to prevent bad data from reaching downstream systems

  • Integrate with CI/CD workflows for continuous quality checks during development

Data Contracts

Data contracts define what a dataset should look like, including its schema, data types, value ranges, and other constraints. They establish a shared agreement between data producers and consumers about what’s expected and what must be upheld.

Both testing and observability play a role in upholding data contracts:

  • Testing validates that data meets the contract during development, pipeline execution, and on schedule.

  • Observability monitors contract adherence in production and detects unexpected issues.

Data Observability vs Data Testing

While data testing and observability are different in when and how they operate, they work best together as a unified strategy.

Approach
Timing
Use case

Data Testing

Proactive and preventative: Pre-production, during development or CI/CD

Prevent breakages before they happen: Validate known rules and enforce contracts

Data Observability

Reactive and adaptive: In production, runtime monitoring

Monitor data behavior and changes over time with automated detection of anomalies, schema changes, and other unexpected issues.

Together, they enable end-to-end data quality management: testing prevents problems, and observability detects those that escape prevention. At the same time, observability can help prioritize which issues to address and shift left to resolve them upstream.

Data quality at scale across the enterprise

Divide and conquer

Managing data quality across hundreds or thousands of datasets requires a scalable, federated approach. Soda enables this through:

  • Metadata-driven observability that adapts checks to each dataset's structure and context

  • Role-based collaboration so teams can take ownership of the data they know best

  • An interface for both engineering and business users, enabling collaboration through code, UI, or APIs, depending on user preference and role

  • Integration with existing tools and workflows, such as data catalogs and incident management systems

  • Pipeline and CI/CD integration to automate data quality checks

Data quality as a team sport

Reliable data depends on collaboration across roles:

  • Data engineers embed tests and monitor pipelines to catch issues early.

  • Data producers and consumers align on expectations through data contracts.

  • Data consumers report issues and collaborate with producers to interpret metrics and resolve problems.

  • Governance teams define and enforce data quality standards.

  • Platform teams deploy, manage, and secure the underlying infrastructure.

Soda Cloud acts as the shared workspace where these roles collaborate, triage incidents, and resolve issues.

Deployment options

Soda offers three deployment models, depending on your infrastructure and data privacy needs.

Deployment Model
Description
Ideal For
Key Features
Considerations

Soda Core

Open-source Python library (with commercial extensions) and CLI for running Data Contracts in your pipelines.

Data engineers integrating Soda into custom workflows.

Full control over orchestration, in-memory data support, contract verification.

No observability features. Required for in-memory sources (e.g., Spark, DataFrames). Data source connections managed at the environment level.

Soda Agent

Managed version of Soda that runs observability features, executes Data Contracts and scheduled them.

Either fully hosted or self-deployed.

Teams seeking a simple, managed solution for data quality.

Centralized data source access, no setup required, observability features enabled. Enables users to create, test, execute, and schedule contracts and checks directly from the Soda Cloud UI.

Required for observability features. Cannot scan in-memory sources like Spark or DataFrames.

Read more about Deployment options

Supported data sources and integrations

Soda integrates with the modern data stack:

  • Data warehouses and databases: Databricks, Snowflake, BigQuery, Redshift, PostgreSQL, MySQL, Spark, Presto, DuckDB, and more.

  • Orchestration platforms: Airflow, Dagster, Prefect, Azure Data Factory.

  • Metadata tools: Atlan, Alation, Collibra, data.world, Zeenea.

  • Cloud providers: AWS, Google Cloud, Azure.

  • BI tools: Looker, Tableau, Power BI.

  • Messaging and ticketing: Slack, Microsoft Teams, Jira, PagerDuty, ServiceNow, Opsgenie.

What’s next?

  • To get started with Soda, check out the end-to-end Quickstart guide.

NextQuickstart

Last updated 4 days ago

Was this helpful?