cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

I-am-Biplab
by New Contributor II
  • 1809 Views
  • 4 replies
  • 4 kudos

Is there a Databricks spark connector for java?

Is there a Databricks Spark connector for Java, just like we have for Snowflake (reference of Snowflake spark connector - https://docs.snowflake.com/en/user-guide/spark-connector-use)Essentially, the use case is to transfer data from S3 to a Databric...

  • 1809 Views
  • 4 replies
  • 4 kudos
Latest Reply
sandeepmankikar
Databricks Partner
  • 4 kudos

You don't need a separate Spark connector ,Databricks natively supports writing to Delta tables using standard Spark APIs. Instead of using JDBC, you can use df.write().format("delta") to efficiently write data from S3 to Databricks tables.

  • 4 kudos
3 More Replies
turagittech
by Contributor
  • 1741 Views
  • 5 replies
  • 1 kudos

Reading different file structures for json files in blob stores

Hi All,We are planning to store some mixed json files in blob store and read into Databricks. I am questioning whether we should have a container for each structure or if the various tools in Databricks can successfully read the different types. I ha...

  • 1741 Views
  • 5 replies
  • 1 kudos
Latest Reply
sandeepmankikar
Databricks Partner
  • 1 kudos

Organize files by schema into subfolders (e.g., /schema_type_a/, /schema_type_b/) in the same container.Avoid putting all JSON types in one folder

  • 1 kudos
4 More Replies
LearnDB123
by New Contributor
  • 3274 Views
  • 2 replies
  • 0 kudos

Saving a file to /tmp is not working after migration to Unity Catalog

Hi,We upgraded our runtime cluster to Unity Catalog recently and since some of the code has been failing which was working fine earlier. We used to save files to "/tmp/" and then move them from temp into our blob storage however since the migration t...

  • 3274 Views
  • 2 replies
  • 0 kudos
Latest Reply
Rahul6
New Contributor II
  • 0 kudos

Hi @filipniziol Could we use volumes for this temp processing rather than doing S3  

  • 0 kudos
1 More Replies
utkarshamone
by New Contributor III
  • 1344 Views
  • 1 replies
  • 0 kudos

Internal errors when running SQLs

We are running Databricks on GCP with a classic SQL warehouse. Its on the current version (v 2025.15)We have a pipeline that runs DBT on top of the SQL warehouseSince the 9th of May, our queries have been failing intermittently with internal errors f...

  • 1344 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 0 kudos

Hi @utkarshamone The error messages you've shared—such as:-- [INTERNAL_ERROR] Query could not be scheduled: HTTP Response code: 503-- ExecutorLostFailure ... exited with code 134, sigabrt-- Internal error—indicate that your Databricks SQL warehouse o...

  • 0 kudos
I-am-Biplab
by New Contributor II
  • 2150 Views
  • 3 replies
  • 1 kudos

Is there a Databricks spark connector for java?

Is there a Databricks Spark connector for Java, just like we have for Snowflake (reference of Snowflake spark connector - https://docs.snowflake.com/en/user-guide/spark-connector-use)Essentially, the use case is to transfer data from S3 to a Databric...

  • 2150 Views
  • 3 replies
  • 1 kudos
Latest Reply
Shua42
Databricks Employee
  • 1 kudos

Hey @I-am-Biplab , If running locally, it is going to be difficult to tune the performance up that much, but there are a few things you can try: 1. Up the partitions and batch size, as much as your machine will allow. Also, running repartition() coul...

  • 1 kudos
2 More Replies
jeremy98
by Honored Contributor
  • 11203 Views
  • 11 replies
  • 4 kudos

Resolved! ImportError: cannot import name 'AnalyzeArgument' from 'pyspark.sql.udtf'

Hello community,I installed databricks extension on my vscode ide. How to fix this error? I created the environment to run locally my notebooks and selected the available remote cluster to execute my notebook, what else?I Have this error: ImportError...

  • 11203 Views
  • 11 replies
  • 4 kudos
Latest Reply
jeremy98
Honored Contributor
  • 4 kudos

@unj1m yes, as Alberto said you don't need to install pyspark, it is included in your cluster configuration.

  • 4 kudos
10 More Replies
Prajit0710
by New Contributor II
  • 674 Views
  • 1 replies
  • 0 kudos

Resolved! Authentication issue in HiveMetastore

Problem Statement:When I execute the below code as a part of the notebook both manually and in workflow it works as expecteddf.write.mode("overwrite") \.format('delta') \.option('path',ext_path) \.saveAsTable("tbl_schema.Table_name")but when I integr...

  • 674 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 0 kudos

Hi @Prajit0710 This is an interesting issue where your Delta table write operation works as expected when run directly,but when executed within a function, the table doesn't get recognized by the HiveMetastore.The key difference is likely related to ...

  • 0 kudos
tebodelpino1234
by New Contributor
  • 3552 Views
  • 1 replies
  • 0 kudos

can view allow_expectations_col in unit catalog

I am developing a dlt that manages expectations and it works correctly.but I need to see the columns__DROP_EXPECTATIONS_COL__MEETS_DROP_EXPECTATIONS__ALLOW_EXPECTATIONS_COLin the unified catalog, I can see them in the delta table that the dlt generat...

tebodelpino1234_0-1739993377613.png tebodelpino1234_1-1739993707990.png tebodelpino1234_2-1739993775604.png
  • 3552 Views
  • 1 replies
  • 0 kudos
Latest Reply
kamal_ch
Databricks Employee
  • 0 kudos

Materialization tables created by DLT include these columns to process expectations but they might not propagate to Unity Catalog representations such as views or schema-level metadata unless explicitly set up for such lineage or column-level exposur...

  • 0 kudos
KS12
by New Contributor
  • 4072 Views
  • 1 replies
  • 0 kudos

Unable to get s3 data - o536.ls.

Error while executingdisplay(dbutils.fs.ls(f"s3a://bucket-name/"))bucket-name has read/list permissionsshaded.databricks.org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://bucket-name/ com.amazonaws.SdkClientException: Unable to ex...

  • 4072 Views
  • 1 replies
  • 0 kudos
Latest Reply
kamal_ch
Databricks Employee
  • 0 kudos

To start with add SSL debugging logs by passing the JVM option -Djavax.net.debug=ssl in cluster configuration. This helps identify whether the handshake is failing due to missing certificates or invalid paths, Also check the cluster initialization sc...

  • 0 kudos
minhhung0507
by Valued Contributor
  • 4440 Views
  • 1 replies
  • 0 kudos

Error Listing Delta Log on GCS in Databricks

I am encountering an issue while working with a Delta table in Databricks. The error message is as follows:java.io.IOException: Error listing gs://cimb-prod-lakehouse/bronze-layer/dbd/customer_info_update_request_processing/_delta_log/ This issue occ...

minhhung0507_0-1739502695009.png
  • 4440 Views
  • 1 replies
  • 0 kudos
Latest Reply
kamal_ch
Databricks Employee
  • 0 kudos

Ensure that the Databricks workspace has the necessary permissions to access the GCS bucket. Check if the service account used for Databricks has "Storage Object Viewer" or a similar role granted. Verify that the path "gs://cimb-prod-lakehouse/bronze...

  • 0 kudos
DaPo
by New Contributor III
  • 4018 Views
  • 2 replies
  • 0 kudos

DLT Fails with Exception: CANNOT_READ_STREAMING_STATE_FILE

I have several DLT Pipeline, writing to some schema in a unity catalog. The storage location of the unity-catalog is managed by the databricks deployment (on AWS).The schema and the dlt-pipeline are managed via databricks asset bundles. I did not cha...

  • 4018 Views
  • 2 replies
  • 0 kudos
Latest Reply
mani_22
Databricks Employee
  • 0 kudos

Hi @DaPo , Have you made any code changes to your streaming query? There are limitations on what changes in a streaming query are allowed between restarts from the same checkpoint location. Refer this documentation The checkpoint location appears to ...

  • 0 kudos
1 More Replies
oscarramosp
by New Contributor II
  • 1834 Views
  • 3 replies
  • 1 kudos

DLT Pipeline upsert question

Hello, I'm working on a DLT pipeline to build a what would be a Datawarehouse/Datamart. I'm facing issues trying to "update" my fact table when the dimensions that are outside the pipeline fail to be up to date at my processing time, so on the next r...

  • 1834 Views
  • 3 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

The error encountered, "Cannot have multiple queries named catalog.schema.destination_fact for catalog.schema.destination_fact. Additional queries on that table must be named," arises because Delta Live Tables (DLT) disallows multiple unnamed queries...

  • 1 kudos
2 More Replies
Zeruno
by New Contributor II
  • 4156 Views
  • 1 replies
  • 0 kudos

UDFs with modular code - INVALID_ARGUMENT

I am migrating a massive codebase to Pyspark on Azure Databricks,using DLT Pipelines. It is very important that code will be modular, that is I am looking to make use of UDFs for the timebeing that use modules and classes.I am receiving the following...

  • 4156 Views
  • 1 replies
  • 0 kudos
Latest Reply
briceg
Databricks Employee
  • 0 kudos

Hi @Zeruno. What you can do is to package up your code and pip install in your pipeline. I had the same situation where I developed some code which ran fine in a notebook, but when used in a DLT pipeline, the deps were not found. Packaging them up an...

  • 0 kudos
jlynlangford
by New Contributor
  • 1144 Views
  • 1 replies
  • 0 kudos

collect() in SparkR and sparklyr

Hello,I'm have a vast difference in performance between SparkR:collect() and sparklyr:collect. I have a somewhat complicated query that uses WITH AS syntax to get the data set I need; there are several views defined and joins required. The final data...

  • 1144 Views
  • 1 replies
  • 0 kudos
Latest Reply
niteshm
New Contributor III
  • 0 kudos

@jlynlangford This is a tricky situation, and multiple resolutions can be tried to address the performance gap,Schema Complexity: If the DataFrame contains nested structs, arrays, or map types, collect() can become significantly slower due to complex...

  • 0 kudos
thomas_berry
by Databricks Partner
  • 1854 Views
  • 3 replies
  • 2 kudos

Resolved! federated queries on PostgreSQL - TimestampNTZ option

Hello,I am trying to migrate some spark reads away from JDBC into the federated queries based in unity catalog.Here is an example of the spark read command that I want to migrate:spark.read.format("jdbc").option("driver", "org.postgresql.Driver").opt...

  • 1854 Views
  • 3 replies
  • 2 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 2 kudos

Thanks @thomas_berry I hope so 

  • 2 kudos
2 More Replies
Labels