cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

kenmyers-8451
by Contributor
  • 3881 Views
  • 2 replies
  • 0 kudos

Long runtimes on simple copying of data

Hi my team has been trying to identify areas where we can improve our processes. We have some long runtimes on processes that have multiple joins and aggregations. To create a baseline we have been running tests on a simple select and write operation...

kenmyers8451_0-1739400824751.png
  • 3881 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Your slow Spark runtime and unexpectedly long WholeStageCodeGen compute times are likely tied to a mix of Delta Lake features (especially deletion vectors), Spark’s physical plan, and partition handling. Here’s a detailed breakdown and advice based o...

  • 0 kudos
1 More Replies
saadi
by New Contributor
  • 3693 Views
  • 1 replies
  • 0 kudos

Resolved! Could not connect Self Hosted MySQL Database in Azure Databricks

Hi,I am trying to connect a self-hosted MySQL database in Databricks but keep encountering errors.Database Setup:The MySQL database is hosted on a VM.We use DBeaver or Navicat to query it.Connection to the database requires an active Azure VPN Client...

  • 3693 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

To connect a self-hosted MySQL database (on a VM, Azure VPN required) to Databricks, you need several components to align: network access from Databricks to MySQL, proper JDBC connector configuration, and correct authentication. This setup is common ...

  • 0 kudos
nishg
by New Contributor III
  • 3499 Views
  • 1 replies
  • 0 kudos

Upgraded cluster to 16.1/16.2 and upload data(append) to elastic index is failling

I have updated compute cluster to both databricks version 16.1 and 16.2 and run the workflow to append data into elastic index but it started failing with below error. The same job is working fine with databricks version 15.  Let me know if anyone co...

  • 3499 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Your error is a known issue appearing after upgrading Databricks clusters to versions 16.1 and 16.2, specifically when running workflows to append data into an Elasticsearch index. This error—"Path must be absolute: myindex/_delta_log"—indicates a ch...

  • 0 kudos
Sujith_i
by New Contributor
  • 3584 Views
  • 1 replies
  • 1 kudos

databricks sdk for python authentication failing

I am trying to use databricks sdk for python to do some account level operations like creating groups and created a databricks config file locally n provided the profile name as argument to AccountClient but authentication keeps failing. the same con...

  • 3584 Views
  • 1 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

Authentication for account-level operations with Databricks SDK for Python requires more than just referencing the profile name in your local .databrickscfg file. While the CLI consults .databrickscfg for profiles and can use them directly, the SDK's...

  • 1 kudos
AvneeshSingh
by New Contributor
  • 3573 Views
  • 2 replies
  • 1 kudos

Autloader Data Reprocess

Hi ,If possible can any please help me with some autloader options I have 2 open queries ,(i) Let assume I am running some autoloader stream and if my job fails, so instead of resetting the whole checkpoint, I want to run stream from specified timest...

Data Engineering
autoloader
  • 3573 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

In Databricks Autoloader, controlling the starting point for streaming data after a job failure requires careful management of checkpoints and configuration options. By default, Autoloader uses checkpoints to remember where the stream last left off, ...

  • 1 kudos
1 More Replies
Nidhig
by Contributor
  • 270 Views
  • 1 replies
  • 2 kudos

Resolved! Global Parameter at the Pipeline level in Lakeflow Job

Hi ,any work around or Databricks can enable global parameters feature at the pipeline level in the lakeflow job.Currently I am working on migrating adf pipeline schedule set up to lakeflow job. 

  • 270 Views
  • 1 replies
  • 2 kudos
Latest Reply
mark_ott
Databricks Employee
  • 2 kudos

Databricks Lakeflow Declarative Pipelines do not currently support truly global parameters at the pipeline level in the same way that Azure Data Factory (ADF) allows, but there are workarounds that enable parameterization to streamline migration from...

  • 2 kudos
VaDim
by New Contributor III
  • 192 Views
  • 1 replies
  • 0 kudos

Resolved! transformWithStateInPandas. Invalid pickle opcode when updating ValueState with large (float) array

I am getting an error when the entity I need to store in a ValueState is a large array (over 15k-20k items). No error (and works correctly) if I trim the array to under 10k samples. The same error is raised when using it as a value for MapState or as...

  • 192 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The error you’re facing, specifically PySparkRuntimeError: Error updating value state: invalid pickle opcod, usually points to a serialization (pickling) problem when storing large arrays in Flink/Spark state such as ValueState, ListState, or MapStat...

  • 0 kudos
SamAdams
by Contributor
  • 139 Views
  • 1 replies
  • 0 kudos

Time window for "All tables are updated" option in job Table Update Trigger

I've been using the Table Update Trigger for some SQL alert workflows. I have a job that uses 3 tables with an "All tables updated" trigger:Table 1 was updated at 07:20 UTCTable 2 was updated at 16:48 UTCTable 3 was updated at 16:50 UTC-> Job is trig...

Data Engineering
jobs
TableUpdateTrigger
  • 139 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

There is no fixed or documented “window” time for the interval between updates to all monitored tables before a job with an "All tables updated" trigger runs in Databricks. The job is triggered as soon as every table in the set has seen at least one ...

  • 0 kudos
ak5har
by New Contributor II
  • 3331 Views
  • 9 replies
  • 2 kudos

Databricks connection to on-prem cloudera

Hello,     we are trying to evaluate Databricks solution to extract the data from existing cloudera schema hosted on physical server. We are using the Databricks serverless compute provided by databricks express setup and we assume we will not need t...

  • 3331 Views
  • 9 replies
  • 2 kudos
Latest Reply
Adrian_Ashley
New Contributor II
  • 2 kudos

I work for a databricks partner called Cirata.  Our Data migrator offering allows  both data and metadata replication  from cloudera to be delivered to the databricks environment , whether this is just delivering it to the ADLS2 object storage or to ...

  • 2 kudos
8 More Replies
pepco
by New Contributor II
  • 213 Views
  • 2 replies
  • 2 kudos

Resolved! Environment in serverless

I'm playing little bit with on the Databricks free environment and I'm super confused by the documentation vs actual behavior. Maybe you could help me to understand better.For the workspace I can define base environment which I can use in serverless ...

Data Engineering
base environment
serverless
  • 213 Views
  • 2 replies
  • 2 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 2 kudos

Hello @pepco , Is it possible to use environments with notebook tasks? Yes—but only in a very specific way. Notebook tasks can use base environments, but you don’t attach them in the job’s YAML. You pick the base env in the notebook’s Environment sid...

  • 2 kudos
1 More Replies
KKo
by Contributor III
  • 446 Views
  • 1 replies
  • 0 kudos

On Prem MS sql to Azure Databricks

Hi allI need to ingest data from on prem MS sql tables using Databricks to Azure Cloud. For the ingest, previously I used notebooks, jdbc connectors, read sql tables and write in unity catalog tables. Now, I want to experiment Databricks connectors f...

  • 446 Views
  • 1 replies
  • 0 kudos
Latest Reply
AbhaySingh
Databricks Employee
  • 0 kudos

This is feature is good to go... I can't think of any disadvantages. Here is a guide.  https://landang.ca/2025/01/31/simple-data-ingestion-from-sql-server-to-databricks-using-lakeflow-connect/  

  • 0 kudos
Suheb
by New Contributor III
  • 124 Views
  • 1 replies
  • 0 kudos

How have you set up a governance structure (data access control, workspace management, cluster polic

If your company uses Databricks with many people, how do you manage security, organize teams, and control costs — and what tools do you use to make it all work smoothly?

  • 124 Views
  • 1 replies
  • 0 kudos
Latest Reply
AbhaySingh
Databricks Employee
  • 0 kudos

Please take a look here to get some initial ideas. https://medium.com/databricks-unity-catalog-sme/a-practical-guide-to-catalog-layout-data-sharing-and-distribution-with-databricks-unity-catalog-763e4c7b7351  

  • 0 kudos
him
by New Contributor III
  • 25678 Views
  • 14 replies
  • 10 kudos

i am getting the below error while making a GET request to job in databrick after successfully running it

"error_code": "INVALID_PARAMETER_VALUE",  "message": "Retrieving the output of runs with multiple tasks is not supported. Please retrieve the output of each individual task run instead."}

Capture
  • 25678 Views
  • 14 replies
  • 10 kudos
Latest Reply
Octavian1
Contributor
  • 10 kudos

Hi @Debayan I'd suggest to also mention this explicitly in the documentation of the workspace client for get_run_outputOne has to pay extra attention to the examplerun_id=run.tasks[0].run_id otherwise it can be easily missed. 

  • 10 kudos
13 More Replies
alhuelamo
by New Contributor II
  • 10606 Views
  • 5 replies
  • 1 kudos

Getting non-traceable NullPointerExceptions

We're running a job that's issuing NullPointerException without traces of our job's code.Does anybody know what would be the best course of action when it comes to debugging these issues?The job is a Scala job running on DBR 11.3 LTS.In case it's rel...

  • 10606 Views
  • 5 replies
  • 1 kudos
Latest Reply
Amora
New Contributor II
  • 1 kudos

You could try enabling full stack traces and checking the Spark executor logs for hidden errors. Null Pointer Exceptions in Scala on DBR often come from lazy evaluations or missing schema fields during I/O. Reviewing your Data Frame transformations a...

  • 1 kudos
4 More Replies
Phani1
by Databricks MVP
  • 4811 Views
  • 4 replies
  • 2 kudos

Convert EBCDIC (Binary) file format to ASCII

Hi Team,How can we convert EBCDIC (Binary) file format to ASCII in databricks? Do we have any libraries in Databricks?

  • 4811 Views
  • 4 replies
  • 2 kudos
Latest Reply
amulight
New Contributor II
  • 2 kudos

Hi Phani1 Were you able to do that successfully ? Can you share the details and steps please. Thanks.

  • 2 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels