All Data Engineering posts

Re: DAB best practices suggestion

balajij8 — Sat, 27 Jun 2026 17:07:22 GMT

You can create Databricks Asset Bundles that are decoupled by domain, managed via multi target declarations within configuration and also driven by immutable, versioned artifacts stored securely within Unity Catalog Volumes. You can rely on explicit CI/CD gating and dynamic, scoped resource names rather than monolithic & hardcoded infrastructure definitions.

Bundle Structure & Domain Isolation

Decoupled Domain Bundles - You can group configurations into small focused bundles aligned to specific data products or business domains instead of monolithic setup.
Shared Lifecycles - Ensure that a single bundle contains only the resources (jobs, pipelines, dashboards) that share a unified deployment lifecycle and ownership domain boundary.
Target Definitions - You can maintain all target definitions (dev, uat, prod) within a single yml per bundle to guarantee environmental structural parity. More details here

Multi-Target Environment Strategy

Development - Configure for feature-branch agility. Implement dynamic resource renaming using built-in metadata expressions (such as ${workspace.current_user.short_name}) to enforce isolation within shared or personal workspaces. Route all computation to development catalogs and schemas.
Staging/User Acceptance Testing - Trigger automated deployments on pull request merges to the main branch. This layer must run full integration suites and validation workflows against pre-production catalogs, mirroring production configurations identically.
Production - Guard production workloads with manual approval workflows and strict role-based access control (RBAC) with the target production Unity Catalog environments.

CI/CD Orchestration (Azure DevOps)

Pull Request Verification - Enforce static analysis by running databricks bundle validate prior to any code merges to catch syntactical and structural anomalies early.
Continuous Deployment (UAT) - Compile code, version artifacts, stage them directly into Unity Catalog volumes and execute target-specific deployments sequentially.
Release Management (Prod) - Restrict production deployments to manual approval gates within Azure DevOps Environments. Re-use the identical, immutable artifacts verified in UAT to eliminate drift.

Artifact & Dependency Management

Unity Catalog Volumes - Store external dependencies (Python Wheels, JARs) inside secure, governed Unity Catalog Volumes rather than embedding large binaries directly into the bundle workspace.
Inter-Bundle Governance - Model complex cross-bundle dependencies explicitly within Azure DevOps YAML pipeline tasks rather than nesting configuration files. Fail pipeline execution immediately if upstream assets are absent.

DAB best practices suggestion

DazzaiDe — Sat, 27 Jun 2026 16:13:18 GMT

We're currently setting up Databricks Asset Bundles (DAB) with a CI/CD pipeline using Azure DevOps.

Our planned development workflow is as follows:

Main branch → Developer creates a feature branch → Implement changes → Create a Pull Request → Senior developers review and approve → Merge into the main branch → Deploy to UAT → After UAT sign-off, deploy to Production.

I would like to hear suggestions specially the best practices as of now

Re: Bundle deployment overwrites artifacts while jobs are running - best practices?

sudhaktr — Sat, 27 Jun 2026 14:54:51 GMT

Yes I understand that part. if you have source_linked_deployment set as false, both the developers will be deploying to the same location under /.bundle directory. Then the overwrite can happen.

If source_linked_deployment is set as True or not set(by default it is True), then the workflow will be pointing to the source. That is respective developer's directory.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Sat, 27 Jun 2026 14:51:39 GMT

Hello Balajij8,

I just wanted to let you know that the issue I posted regarding Spark not writing Parquet files was actually due to my own mistake.

I had mounted the data volume only in my Spark job (driver/scheduler) container instead of the Spark worker container. Since the worker executes the tasks and writes the output, the Parquet files were being stored inside the worker's filesystem. I was checking the job container, which only had the checkpoint and metadata directories because those were the only volumes I had mounted in my Docker Compose configuration.

Thank you for your time and for helping me investigate the issue. I really appreciate your guidance.

Re: Bundle deployment overwrites artifacts while jobs are running - best practices?

animeshjain — Sat, 27 Jun 2026 14:30:48 GMT

no that not what I am talking about its like this in the picture we can generate a build artifact and use this in the job
so if a developer run the deploy and running there job and the same time 2nd deploy happens which overwrites the build. so now the first job is looking for whl which is overwritten

Re: Bundle deployment overwrites artifacts while jobs are running - best practices?

sudhaktr — Sat, 27 Jun 2026 14:09:58 GMT

Do you have source_linked_deployment set as false? That's probably causing it.

Bundle deployment overwrites artifacts while jobs are running - best practices?

animeshjain — Sat, 27 Jun 2026 11:51:17 GMT

Hi everyone,

I'm using #Declarative Automation Bundles (DAB) to deploy data pipelines, and I've run into an issue with concurrent job runs and deployment

What happened:

I started a job that depends on a wheel file built by the bundle (timestamped artifact in .bundle/.../artifacts/.internal/)
While the job was running, I ran databricks bundle deploy again
The deployment generated a new timestamped wheel file and removed the old one
My running job failed with ERROR_NO_SUCH_FILE_OR_DIRECTORY because it couldn't find the original artifact

My concern: This seems like it could be a problem in team environments. If two developers are working on the same bundle target:

Developer A starts a job from their deployment
Developer B deploys their changes to the same target
Developer A's running job fails due to missing artifacts

My questions:

Is this expected behavior, or am I misusing bundles?
What are the recommended patterns to prevent this in multi-developer teams?
Should each developer use personal bundle targets (dev_alice, dev_bob), or is there a better approach?
Does this same issue apply to production deployments? If so, how should we handle long-running jobs during deployment?
How should production CI/CD deployments be coordinated when scheduled or long-running jobs might be active? Should we check for active runs before deploying? Is there a built-in mechanism or recommended pattern to prevent breaking currently executing production jobs?

Any guidance on best practices for coordinating bundle deployments with active job runs would be appreciated!

Lakeflow connect Native connectors (tik, meta ads, Google Ads) - one table per account

GabeMatch — Fri, 26 Jun 2026 20:55:21 GMT

We want to leverage these connectors to pull in marketing spend data. But the docs seem to say that the destination must be unique based on accounts. For Tik, we have a hundred accounts... each account will have a destination table for each object. So like this...

ads_account1, campaign_account1
ads_account2, campaign_account2

Total tables is accounts * number of objects. 100 accounts and assuming 6 tables means 600 tables!

Is there a better solution? Would be great to ingest into only one table... all ads for an accounts feed into one Ads table.

Managed ingestion connectors don't support duplicate destination table names in the same schema

also, forum says:

> The message subject contains t*i*k*t*o*k, which is not permitted in this community. Please remove this content before sending your post.

makes it hard to make a post about its connector!

Streaming Amazon DocumentDB to Databricks in near real time - what's the best approach?

AustinBen — Fri, 26 Jun 2026 15:44:56 GMT

Hi everyone,

I'm looking for advice from anyone who has implemented near real-time ingestion from Amazon DocumentDB into Databricks.

Our current architecture is:

Application → Amazon DocumentDB
Python AWS Lambda functions capture changes from DocumentDB
Lambda continuously writes the data into Amazon Redshift
Redshift is then used as our data warehouse

This setup has been working well for us.

We're now evaluating Databricks as our analytics platform, but I'm not finding a straightforward way to stream data directly from DocumentDB into Databricks. I've heard that Databricks doesn't have a native connector or CDC support for Amazon DocumentDB.

My questions are:

Has anyone successfully implemented near real-time or real-time ingestion from Amazon DocumentDB into Databricks?
What architecture are you using?

I'm interested in production-proven architectures rather than proof-of-concept examples.

Thanks in advance!

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 15:18:03 GMT

Thanks for your reply.

I investigated the output directories a bit further before trying another path. If my understanding is correct, the volume mount and read/write permissions do not seem to be the issue in my case. The reason I think this is that both the Docker container and my local machine continuously create and update the checkpoints and data directories. The checkpoint files, offsets, commits, and _spark_metadata are all being written successfully,
which suggests that Spark can write to the mounted volume.

What confuses me is that _spark_metadata contains entries such as:

{"path":"file:///opt/spark/app/data/whale_alerts/part-00000-ac552411-0fa6-47c8-b120-4dfcc9227b09-c000.snappy.parquet","size":1125,"isDir":false,"modificationTime":1782477948968,"blockReplication":1,"blockSize":33554432,"action":"add"}

which indicates that Spark believes a Parquet file was committed. However, when I search both inside the container and on the host, the referenced part-*.snappy.parquet files do not exist—only the _spark_metadata directory is present. Could this indicate an issue during the file commit phase rather than a volume mount or permission problem? If so, are there any Spark or Hadoop configurations that you would recommend checking next?

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 13:37:55 GMT

Spark Structured Streaming writes to file sinks and generally it uses a phased commit by writing temporary files to the output directory followed by writing metadata with references and a final commit by moving/renaming temp files to final names.

You can verify the Docker side volume mount misconfigurations as some docker configurations use temporary filesystems that get cleaned up or a background process removes the files. The files are written but immediately deleted.

You can also verify that /opt/spark/app/data is actually mounted to the host & ensure that the permissions of _spark_metadata directories and the other directories remain the same - read/write for Spark to perform all operations seamlessly.

You can change the code to write data to a path that has read/write access for Spark to perform all operations & validate & confirm.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 13:18:25 GMT

Hello balajij8,

Before trying your suggestions, I decided to inspect the filesystem inside my Spark container once more.

I found something that has changed my understanding of the problem. There are no errors being reported by the streaming job, and the checkpoint and _spark_metadata directories are being updated continuously. I also found metadata entries that indicate Spark believes it has successfully written Parquet files.

However, I cannot find the actual part-*.snappy.parquet files in the output directory, even though the metadata references them. For example:

$ cd _spark_metadata
$ ls
0 1 2 3
$ cat 1
v1
{"path":"file:///opt/spark/app/data/whale_alerts/part-00000-ac552411-0fa6-47c8-b120-4dfcc9227b09-c000.snappy.parquet","size":1125,"isDir":false,"modificationTime":1782477948968,"blockReplication":1,"blockSize":33554432,"action":"add"}

But when I run:

find /opt/spark/app/data -name "*.parquet"

no Parquet files are found, either inside the container or on my host machine. Only the _spark_metadata files exist.

Since the streaming job is processing records successfully and the metadata is being written, I'm now wondering whether this is related to the file sink, filesystem, or Docker volume configuration rather than the upstream pipeline.

Before I start changing the Kafka configuration or thresholds, do you have any thoughts on why Spark would generate metadata entries without the corresponding Parquet files?

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 12:33:26 GMT

Thank you, balajij8, for your suggestions. I really appreciate your time and guidance.

I'll try the different configurations you recommended and investigate further. Once I've tested them, I'll come back and share the results.

Thanks again for your help!

P.S. "Did you see the messages I have already sent... I still don't see them above?"

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 12:12:10 GMT

The configuration is correct & mostly upstream is the issue. The Parquet sink can only write files when it receives data from the upstream. You can validate the 2 key configurations given below

startingOffsets - latest - Code skips all historical Kafka data and it only processes messages that arrive after the stream starts. You can set it to earliest & validate
WHALE_THRESHOLD_USD 50000 - Typical value can be 5 - 10. You can lower the threshold & validate temporarily and set it to 50000 later

Even if Kafka has messages the pipeline filters out them because of the configurations.

Re: Implementing Row Level Security using ABAC

Louis_Frolio — Fri, 26 Jun 2026 12:10:51 GMT

Hi @Rupa0503 ,

Yes, you can do row-level security across one table or many in Unity Catalog without copying data per role. @balajij8 pointed you in the right architectural direction (ABAC with governed tags, a reusable row-filter function, and centrally managed policies). Let me add the official requirements, a couple of corrections, and a simpler option if you only have a few tables.

The first thing to decide is your path, and it comes down to scale.

Path A is the classic row filter. It's the simplest approach and it's the best fit when you only have a few tables. No tags or policies needed. You attach a UDF directly to each table.

-- Function returns TRUE for rows the caller may see
CREATE OR REPLACE FUNCTION main.default.dept_filter(dept STRING)
RETURN is_account_group_member('data_admins')   -- admins see all
  OR is_account_group_member(dept);              -- else only your dept's rows

-- Attach to each table
ALTER TABLE main.default.employee_data
  SET ROW FILTER main.default.dept_filter ON (department);

is_account_group_member() is the role hook. It evaluates the querying user's group membership at runtime. You repeat the ALTER TABLE for each table you want filtered.

Path B is ABAC policies. This is the better fit when you want one definition to govern many tables, including ones that don't exist yet. There are four steps.

Define governed tags at the account level (Catalog Explorer under tag policies, or the REST API/Terraform) for the attributes that drive access, things like department, region, and sensitivity. One correction here: governed tags are not created with an inline CREATE TAG ... VALUES(...) statement. They're managed at the account level, then assigned to objects.
Assign tags to the relevant tables and columns.

ALTER TABLE main.default.employee_data SET TAGS ('department' = 'hr');

Create the row-filter UDF. It returns a BOOLEAN, same as Path A.
Create a policy on the catalog, schema, or table, bound to the tagged objects.

CREATE POLICY dept_rls
ON SCHEMA main.default
COMMENT 'Users see only their department''s rows'
ROW FILTER main.default.dept_filter
TO `account users` EXCEPT `data_admins`
FOR TABLES
MATCH COLUMNS has_tag('department') AS dept
USING COLUMNS (dept);

The payoff with Path B is that you define it once on the schema or catalog, and any current or future table carrying the department tag gets filtered automatically. No per-table wiring.

A few guardrails to get right before you go to production:

ABAC adds restrictions on top of access, it doesn't grant it. You still need the normal object-level GRANT on the table. ABAC only filters what's visible after access is granted.
Requirements and privileges: supported compute (serverless or DBR 16.4+), MANAGE on the securable, and EXECUTE on the UDF. ABAC policies are GA as of mid-2026.
Only one distinct row filter can resolve per user and table at runtime. If multiple different filters apply to the same user and table, Databricks errors out. So consolidate your role logic into one reusable function, or make sure your policies are mutually exclusive.
Add an admin or service-account escape hatch in the UDF (the data_admins check above), otherwise pipelines and owners can lock themselves out.
For complex role to data mappings, have the UDF run an EXISTS query against a lookup table (e.g. role_access_map) instead of chaining is_account_group_member() CASE branches. It's much easier to maintain as your roles grow.
If you also need to hide columns or values (masking an SSN, for example), use the sibling column mask feature. Same ABAC machinery, just COLUMN MASK instead of ROW FILTER.

The short version: for RLS across many tables without duplicating data, use Unity Catalog ABAC with governed tags and a single reusable row-filter UDF. If you only have a few tables, a direct SET ROW FILTER gets you there in two statements.

Docs:

ABAC in Unity Catalog: https://docs.databricks.com/aws/en/data-governance/unity-catalog/abac/
Create and manage row filter and column mask policies: https://docs.databricks.com/aws/en/data-governance/unity-catalog/abac/policies
Row filters and column masks: https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks
Requirements, quotas, and limitations: https://docs.databricks.com/aws/en/data-governance/unity-catalog/abac/requirements

Cheers, Louis.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 11:59:37 GMT

I have send reply for this message 5 times already I don't know what is going on here

Re: how to access snapshots in iceberg tables?

Louis_Frolio — Fri, 26 Jun 2026 11:55:39 GMT

@gaurang033 , I believe my solution gets you going in the right direction. Please give it a read and let me know. Cheers, Louis.

Re: Is there a way to deactivate genie auto corretion

Ashwin_DSA — Fri, 26 Jun 2026 11:55:13 GMT

Hi @Félix_banqi,

Sorry you are facing this issue. That definitely doesn’t sound like the intended experience.

I would like to understand the issue better to give you a better steer. Is there an example you can share?

In the meantime, given that you have already tried the developer settings, a couple of things may help...

If you are using Genie Code in Agent mode, it can behave much more autonomously. Databricks documents that Agent mode is designed for multi-step workflows, while Chat mode is better for narrower help, such as explanations and simpler code generation, so switching to Chat mode is often the safest option when you want assistance without many automatic changes. See Use Genie Code.
It's also worth keeping approvals strict. The docs say Genie Code asks for approval before using tools like editing notebooks or running code, and "Ask every time" is the default behaviour. See Use Genie Code.
For notebook/code fixes specifically, Genie Code supports diff-based flows like /fix, where you can accept or reject the proposed change, and accepted code does not automatically run. See Get coding help from Genie Code.
Databricks also calls out that results can vary and usually improve when prompts are more explicit... for example, specifying the exact output you want, the library to use, or the format of the answer. See Tips to improve Genie Code responses.

If you already got into a bad state, another useful point is that Genie Code edits are tracked in revision history, and Databricks says you can roll back changes across notebooks, queries, files, and pipelines. See Introducing Genie Code.

There is in-product feedback built in... Databricks documents the Useful/Not useful controls directly under Genie Code answers in Use Genie Code. If this is causing code corruption or unexpected edits, I would also recommend raising it through their normal Databricks support channel with as much detail as possible.

Once I have additional information (such as an example), I'm happy to raise this internally.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 11:54:44 GMT

Thank you, balajij8, for your suggestion about enabling case-sensitive mode. It worked! The process now moves past the previous error, and Spark is successfully consuming data from Kafka.

However, it looks like I've run into another issue. Although the streaming job is consuming the data, it doesn't appear to be writing any Parquet files as expected.

I do see the checkpoint directories being created correctly, both inside the Spark container and on my local machine through the mounted volume, so it seems the streaming queries are running. The only thing missing is the Parquet output.

I'll investigate this next, but if you have any suggestions about what might cause Spark Structured Streaming to create checkpoints without writing any output files, I'd really appreciate your guidance.

following is my Parquet sink:

whale_query = ( whale_df.writeStream .queryName("whale_alerts") .format("parquet") .outputMode("append") .option( "path", "/opt/spark/app/data/whale_alerts" ) .option( "checkpointLocation", "/opt/spark/app/checkpoints/whale_alerts" ) .trigger(processingTime="10 seconds") .start() ) # ======================================================================== # KLINE PARQUET SINK # ======================================================================== kline_query = ( kline_df.writeStream .queryName("candlestick_history") .format("parquet") .outputMode("append") .option( "path", "/opt/spark/app/data/candlesticks" ) .option( "checkpointLocation", "/opt/spark/app/checkpoints/candlesticks" ) .trigger(processingTime="10 seconds") .start() ) print("🚀 Whale detection pipeline running") print("🚀 Candlestick pipeline running") spark.streams.awaitAnyTermination()

Thank you again for your help!

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 11:03:07 GMT

Do check the other 2 options listed above too - upfront schema setup & field renaming