cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

shan-databricks
by Databricks Partner
  • 90 Views
  • 2 replies
  • 3 kudos

Lakeflow Connect: Data Ingestion from SQL Server to Databricks

We have a use case to ingest data from SQL Server into Databricks using Lakeflow Connect. There are 100 tables, and on a daily basis we will perform inserts, updates, and deletes based on CDC data. For this requirement, how can we enable multiple par...

  • 90 Views
  • 2 replies
  • 3 kudos
Latest Reply
amirabedhiafi
New Contributor II
  • 3 kudos

Hello @shan-databricks  !One additional point, I would also validate the expected load with the SQL Server DBA because even if Lakeflow manages the parallelism internally the source SQL Server still needs to handle those concurrent reads. For 100 tab...

  • 3 kudos
1 More Replies
Darshan137
by New Contributor II
  • 121 Views
  • 2 replies
  • 1 kudos

Transitioning from ADF to Databricks Workflows: Best Practices in a Multi-Workspace (dev-prod)

Hi Community,We have a data processing framework running on Azure Databricks with Unity Catalog, and we're evaluating options to consolidate our orchestration entirely within the Databricks ecosystem.CURRENT ARCHITECTURE:~20 use cases, each containin...

  • 121 Views
  • 2 replies
  • 1 kudos
Latest Reply
amirabedhiafi
New Contributor II
  • 1 kudos

Hello @Darshan137  !Few things I will add to @Lu_Wang_ENB_DBX  answer that I did on a similar project.If ADF currently passes values such as environment, run date, catalog, schema, or business domain, define a clear parameter contract in Lakeflow Job...

  • 1 kudos
1 More Replies
wschoi
by New Contributor III
  • 20250 Views
  • 17 replies
  • 17 kudos

How to fix plots and image color rendering on Notebooks?

I am currently running dark mode for my Databricks Notebooks, and am using the "new UI" released a few days ago (May 2023) and the "New notebook editor."Currently all plots (like matplotlib) are showing wrong colors. For example, denoting:```... p...

  • 20250 Views
  • 17 replies
  • 17 kudos
Latest Reply
griffen_kociela
New Contributor
  • 17 kudos

Still a problem when using Plotly visualizations.

  • 17 kudos
16 More Replies
MikeGo
by Valued Contributor
  • 499 Views
  • 7 replies
  • 2 kudos

Table update trigger and File Arrival trigger latency

Hi team,When using table update or file arrival trigger, what latency I can expect for the trigger. Does Databricks poll the source by some schedule? If yes, whether the poll is free?Thanks

  • 499 Views
  • 7 replies
  • 2 kudos
Latest Reply
MikeGo
Valued Contributor
  • 2 kudos

Hi @Ashwin_DSA ,Appreciate for the further clarification. Let's make this even clearer. "the trigger hands your job a parameter payload with the updated table list and the most recent commit version"This is a good thing but likely it cannot be used, ...

  • 2 kudos
6 More Replies
Avinash_Narala
by Databricks Partner
  • 333 Views
  • 2 replies
  • 0 kudos

Data Loss in Incremental Batch Jobs Due to Latency in delta file write to blob

Hi everyone,I am facing a data consistency issue in my Databricks incremental pipeline where records are being skipped because of a time gap between when a record is processed and when the physical file is finalized in Azure Blob Storage (ABFS).Our A...

  • 333 Views
  • 2 replies
  • 0 kudos
Latest Reply
balajij8
Contributor III
  • 0 kudos

You can handle it as belowFix the Bronze Write - The 20+ minutes commit gap suggests metadata contention or "Small File Issues" in the bronze delta tables. You can optimize tables manually or enable Optimized Write and Auto Optimize if feasible. This...

  • 0 kudos
1 More Replies
AdrianLobacz
by Databricks Partner
  • 160 Views
  • 1 replies
  • 0 kudos

Best option for parallel processing

I faced some challenges in my projects related to parallel processing in Databricks. In many cases, the issue was not the volume of data itself, but the overall execution time. I was processing a relatively small number of objects, but each object re...

  • 160 Views
  • 1 replies
  • 0 kudos
Latest Reply
balajij8
Contributor III
  • 0 kudos

The Driver was the bottleneck in the Thread Pool approach. By moving to Serverless Workflows, you can shift the orchestration weight to the Databricks Control Plane.Eliminate Driver Saturation: Serverless compute for Workflows natively handles task d...

  • 0 kudos
RodrigoE
by New Contributor III
  • 217 Views
  • 4 replies
  • 0 kudos

Ingest data from REST endpoint into Databricks

Hello,I'm looking for the best option to retrieve between 1-1.5TB of data per day from a REST API into Databricks.Thank you,Rodrigo Escamilla

  • 217 Views
  • 4 replies
  • 0 kudos
Latest Reply
rohan22sri
New Contributor III
  • 0 kudos

Hi Rodrigo,One simple approach I’ve used is calling the REST API directly from a Databricks notebook using standard Python libraries—no extra setup or tools required.The idea is to keep it minimal: generate the API signature, call the endpoint, and l...

  • 0 kudos
3 More Replies
Oumeima
by New Contributor III
  • 833 Views
  • 5 replies
  • 2 kudos

Resolved! Lakeflow Connect - SQL Server - Database Setup step keeps failing

Hello,I am trying to ingest data from an Azure SQL Database using lakeflow connect.- I'm using a service principle for authentication (created the login and user in the DB am trying to ingest)- The utility script was executed by a DB owner=== Install...

  • 833 Views
  • 5 replies
  • 2 kudos
Latest Reply
Oumeima
New Contributor III
  • 2 kudos

We figured out the issue finally! We checked the database sql audit logs and noticed that there was a particular query that was taking too long (4min) for the ingestion user. This was causing a timeout. This query is very simple and takes usually a c...

  • 2 kudos
4 More Replies
ashutoshacharya
by New Contributor
  • 164 Views
  • 1 replies
  • 1 kudos

Resolved! Unable to see lakeflow designer option in my free edition databricks account

I am unable to see the lakeflow designer option in my databricks account. Even the previews option is not there ... Please let me know how can access that 

  • 164 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 1 kudos

Hi @ashutoshacharya, Right now, Lakeflow Designer is in Public Preview, and it isn’t fully rolled out to Databricks Free Edition yet, which is why you don’t see it in the UI or under Previews. On full (paid or trial) workspaces, a workspace admin can...

  • 1 kudos
maikel
by Contributor II
  • 176 Views
  • 1 replies
  • 0 kudos

Uploading file to volume and start ingestion job

Hello Community!I am writing to you with my idea about data ingestion job which we have to implement in our project.The data which we have are in CSV file format and depending on the case it differs a little bit. Before uploading we pivoting csv file...

  • 176 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @maikel, You don't have to build a custom solution for this. Databricks now has native components that align very well with what you want. If you want the job to start as soon as new files land in a volume, the recommended approach is to use file-...

  • 0 kudos
murtadha_s
by Databricks Partner
  • 126 Views
  • 1 replies
  • 0 kudos

What the maximum size to read using dbutils.fs.head

Hi,What the maximum size to read using dbutils.fs.head()?is there a limit? because AI says 10MB and I couldn't find useful info in documentations, while I tried in the actual one and it was only limited by the driver memory.Thanks in advance. 

  • 126 Views
  • 1 replies
  • 0 kudos
Latest Reply
DivyaandData
Databricks Employee
  • 0 kudos

dbutils.fs.head() itself does not have a documented hard cap like 10 MB. From the official dbutils reference, the signature is: dbutils.fs.head(file: String, max_bytes: int = 65536): String “Returns up to the specified maximum number of bytes in t...

  • 0 kudos
DavidKxx
by Contributor
  • 215 Views
  • 2 replies
  • 1 kudos

Resolved! Data in Unity Catalog that can't be previewed

This is a small deficiency, but a fix would be nice to have.For a long time now, the Sample Data previewer in the Unity Catalog explorer has been unable to show tables that contain a certain kind of column.  Instead of showing sample rows of the tabl...

  • 215 Views
  • 2 replies
  • 1 kudos
Latest Reply
DavidKxx
Contributor
  • 1 kudos

Yes, my vector space is commonly of dimension 4000 or 8000.I don't write any dense vectors to table; can't speak to what happens previewing that type.Thanks for taking up the issue!

  • 1 kudos
1 More Replies
vidya_kothavale
by Contributor
  • 391 Views
  • 6 replies
  • 7 kudos

Resolved! Managed Delta table: time travel blocked after automatic VACUUM

Hi,On a managed Delta table  I get:SELECT * FROM abc VERSION AS OF 25;Error:DELTA_UNSUPPORTED_TIME_TRAVEL_BEYOND_DELETED_FILE_RETENTION_DURATION Cannot time travel beyond delta.deletedFileRetentionDuration (168 HOURS).Audit logs show VACUUM START/END...

  • 391 Views
  • 6 replies
  • 7 kudos
Latest Reply
balajij8
Contributor III
  • 7 kudos

VACUUM will never delete files on the latest version even if Version 10 was not accessed or modified as it represents the current state of the table. VACUUM targets files that are not referenced by the recent version. It identifies files that were re...

  • 7 kudos
5 More Replies
Muralidharan_A
by New Contributor
  • 118 Views
  • 1 replies
  • 0 kudos

Supporting File unrecognition in DLT Pipeline.

We have a dlt pipeline which creates some same table, which are created based on some transformation and those transformation are kept inside a function in a seperate file. and those file were used using import function.we are deploying those changes...

  • 118 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @Muralidharan_A, To your question about whether retry_on_failure does more than a manual refresh, the answer is yes! retry_on_failure (along with pipelines.numUpdateRetryAttempts and pipelines.maxFlowRetryAttempts) performs classified, timed retri...

  • 0 kudos
397973
by New Contributor III
  • 241 Views
  • 2 replies
  • 1 kudos

Resolved! Jobs & Pipelines: is it possible for "Run parameters" to display a value generated in code?

Hi. I'm testing out the "Run parameters" you see in Jobs & Pipelines. As far as I know, this value is set manually by "Job parameters" on the right side bar. Can I set the value within code though? Like if I want something dynamically generated depen...

397973_0-1776867978321.png
  • 241 Views
  • 2 replies
  • 1 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 1 kudos

Hi @397973, Interesting question and I did not know the answer. So, I ran the test you described on my own workspace. Sharing what I found in case it saves you time. The short answer is that the task values won't populate the Run parameters column. V...

  • 1 kudos
1 More Replies
Labels