Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I have API that triggers Spark calculations - with API hosted by Python 3.12 pod in AKS and connects to Databricks cluster using Databricks 18.1.1.Initially I was using getOrCreate call on my API requests and all works.But problem is - as Spark sessi...
I have Python 3.12 Pod in AKS using DatabricksConnect 18.1.1 connecting to Databricks cluster 18.1.All works great and normally I see no issues running series of Spark queries But once a while, even without any load on dedicated cluster we have, quer...
I’ve been exploring a metadata-driven approach to data engineering through a project called Data Engineering Copilot.The idea is to treat Source-to-Target Mapping (STTM) documents as structured metadata rather than static documentation.Instead of man...
A lot of AI announcements these days start to sound similar after a while. A new model is better. A new agent is faster. A new framework can do more. And most of the time, the conversation stays focused on the tool itself.That is why Omnigent caught ...
I have some json files existing in a specific volume when I try to search for them they don't appear but when I query the the volume using python I am able to get them and read their content.Any help ?
Hi,The global workspace search won't return results for files stored in Unity Catalog Volumes. Its indexing is focused on workspace assets and catalog-managed objects, rather than the underlying files within a Volume.To locate files in a Volume, navi...
Hi,I need to compare the sizes of my delta tables , what's the correct approach ?Table size reported by analyze command ? , but how do I check the delta log size , if I enable CDF .. how do I know the CDF log size(the overhead it adds) ? , kind of l...
Hi @RGSLCA DESCRIBE DETAIL is the best starting point if you're comparing Delta table sizes, but it's important to understand what it reports. The sizeInBytes value represents only the latest active snapshot of the table, not the total storage consum...
Hi Databricks Community,I am able to list the container from my databricks workspace but unable to list the folder and files further.If I try to access the same files and folder from the Databricks UI, external location path, I am able to see all fil...
Following are may be the Causes1. Different authentication methods- The UI's external location uses Unity Catalog credentials- Your dbutils.fs.ls() command uses the compute's Spark configurations- These may be using different credentials with differe...
Hi everyone, I’m working with around 22,000 Unity Catalog external Delta tables, and my requirement is to execute DESCRIBE HISTORY table_name LIMIT 1 for each table and append the latest record into a single consolidated table. I’ve already tried mul...
Hi,The reason your performance degrades so badly (4 mins for 2k tables, but 50 mins for 12k) is because of the Spark Driver. When you run spark.sql("DESCRIBE HISTORY...") inside a ThreadPoolExecutor, every single one of those 22,000 queries has to be...
Is there any difference between pyspark.RDD.foreachPartition vs pyspark.sql.DataFrame.foreachPartition under the hood? The PySpark documentation describes pyspark.sql.DataFrame.foreachPartition as "a shorthand for df.rdd.foreachPartition()"If DataFra...
Although the PySpark documentation states that DataFrame.foreachPartition() is a shorthand for df.rdd. foreachPartition(), there is an important difference when running on Databricks shared clusters (especially with Unity Catalog and Spark Connect).D...
Hi All ,We are facing issues while reading Storage account where stream data from data verse in Unity catalog through External table but not every time . It was running fine with hiveAn error occurred while calling o393.sql.: org.apache.spark.SparkEx...
This issue appears to be related to Azure Storage access through Unity Catalog rather than the data itself, especially since the same workload was working fine with Hive and the failure is intermittent.A few areas worth checking:1. Storage Credential...
Hi,I have created a multi-page dashboard in databricks. I want to download all the pages of the dashboard as a single pdf file. But when i export the dashboard I get it only in .json format. Is there a way to download all the pages as a pdf file?
Dashboard provides a Download as PDF capability for published dashboards. You can distribute a multi-page dashboard as a PDF with all pages & configure a scheduled email subscription and include all dashboard pages in the generated PDF.You can follow...
Lakebase just went GA. Here's what a production migration actually looks like.For most of the last decade, our data infrastructure lived in two separate worlds.On one side: a transactional database handling operational workloads — the writes, the loo...
Hi,I use Genie code extensively for research , plan and development for building ETL scripts and code migrations.As per my knowledge Databricks manages the backend LLM models for Genie code agent.I wanted to try Genie code with Frontier models for my...
Hi @Mailendiran,
From what’s publicly documented, Genie Code already uses frontier models behind the scenes, but it isn’t exposed as a bring-your-own-model or manual model-selection experience. Databricks describes Genie Code as an agentic system tha...
I define all clusters as variable in separate files, so I can re-use them. Then I am accessing them in jobs as: The issue is that I want to change just the custom_tags in the cluster when instancing it for a job, cause my tags are different for each ...
Yes, you can achieve this seamlessly, but not by overriding the custom_tags inside the cluster variable. Instead, you define your specific tags at the Job level, and Databricks automatically merges them with your cluster variable's tags.Because compl...