Data Engineering

Forum Posts

Sorted by:

by UtkarshTrehan • New Contributor

11-01-2023 6:28:57 PM

3410 Views
3 replies
1 kudos

Inconsistent Results When Writing to Oracle DB with Spark's dropDuplicates and foreachPartition

It's more a spark question then a databricks question, I'm encountering an issue when writing data to an Oracle database using Apache Spark. My workflow involves removing duplicate rows from a DataFrame and then writing the deduplicated DataFrame to ...

Data Engineering

3410 Views
3 replies
1 kudos

11-01-2023 6:28:57 PM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 11:34:10 PM

1 kudos

Hi @UtkarshTrehan , To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This will also help other community members who may have similar ...

1 kudos

11-20-2023 11:34:10 PM

2 More Replies

by carlosna • New Contributor II

11-02-2023 2:15:35 AM

9223 Views
2 replies
1 kudos

Resolved! Recover files from previous cluster execution

I saved a file with results by just opening a file via fopen("filename.csv", "a").Once the execution ended (and the cluster shutted down) I couldn't retrieve the file.I found that the file was stored in "/databricks/driver", and that folder empties w...

Data Engineering

9223 Views
2 replies
1 kudos

11-02-2023 2:15:35 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 11:31:36 PM

1 kudos

Hi @carlosna , I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

1 kudos

11-20-2023 11:31:36 PM

1 More Replies

by Joe1912 • New Contributor III

11-02-2023 2:28:01 AM

1557 Views
3 replies
0 kudos

Resolved! Consume 2 kafka topic with different schemas on 1 cluster databricks

Hi everyone,I have a concern that is there any way to read stream from 2 different kafka topics with 2 different in 1 jobs or same cluster? or we need to create 2 separate jobs for it ? (Job will need to process continually)

Data Engineering

1557 Views
3 replies
0 kudos

11-02-2023 2:28:01 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-02-2023 4:19:51 AM

0 kudos

Hi @Joe1912 , It's certainly reasonable to run a number # of concurrent streams per driver node. Each .start() consumes a certain amount of driver resources in spark. Your limiting factor will be the load on the driver node and its available resour...

0 kudos

11-02-2023 4:19:51 AM

2 More Replies

by Joe1912 • New Contributor III

11-02-2023 3:42:17 AM

786 Views
2 replies
1 kudos

Resolved! Strategy to add new table base on silver data

I have a merge function for streaming foreachBatch kind ofmergedf(df,i): merge_func_1(df,i) merge_func_2(df,i)Then I want to add new merge_func_3 into it. Is there any best practices for this case? when streaming always runs, how can I process...

Data Engineering

786 Views
2 replies
1 kudos

11-02-2023 3:42:17 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 11:23:18 PM

1 kudos

Hi @Joe1912 , I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

1 kudos

11-20-2023 11:23:18 PM

1 More Replies

by DavidStarosta • New Contributor II

11-02-2023 5:55:50 AM

1698 Views
3 replies
0 kudos

Resolved! Databricks Asset Bundles Jobs Updated instead of Create

Hello, is it possible to just update parameter values in different workspaces?YAML source code taken from workflow jobs always create a new job. I'd like to just change/update parameter values when I deploy bundle to different workspaces/environments...

Data Engineering

1698 Views
3 replies
0 kudos

11-02-2023 5:55:50 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 11:22:25 PM

0 kudos

Hi @DavidStarosta , I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

0 kudos

11-20-2023 11:22:25 PM

2 More Replies

by JohnJustus • New Contributor III

11-02-2023 11:49:14 AM

8357 Views
3 replies
0 kudos

Resolved! Accessing Excel file from Databricks

Hi,I am trying to access excel file that is stored in Azure Blob storage via Databricks.In my understanding, it is not possible to access using Pyspark. So accessing through Pandas is the option,Here is my code.%pip install openpyxlimport pandas as p...

Data Engineering

8357 Views
3 replies
0 kudos

11-02-2023 11:49:14 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-03-2023 1:31:07 AM

0 kudos

Hi @JohnJustus , To resolve the FileNotFoundError when reading from Azure Blob Storage in Databricks, you need to use the "wasbs" protocol for the file path reference instead of the local file system path. Here's a summary of the steps to address th...

0 kudos

11-03-2023 1:31:07 AM

2 More Replies

by dbuser1234 • New Contributor

11-03-2023 1:28:22 AM

1024 Views
1 replies
0 kudos

Resolved! How to readstream from multiple sources?

Hi I am trying to readstream from 2 sources and join them into a target table. How can I do this in pyspark? Egt1 + t2 as my bronze table. I want to readstream from t1 and t2, and merge the changes into t3 (silver table)

Data Engineering

1024 Views
1 replies
0 kudos

11-03-2023 1:28:22 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 11:14:13 PM

0 kudos

Hi @dbuser1234 , Certainly! To read stream data from two sources, join them, and merge the changes into a target table in PySpark, you can follow these steps: Read Stream Data from Sources (t1 and t2): Use spark.readStream to read data from both t1...

0 kudos

11-20-2023 11:14:13 PM

by Abel_Martinez • Contributor

12-23-2022 7:44:50 AM

6653 Views
9 replies
6 kudos

Resolved! Why I'm getting connection timeout when connecting to MongoDB using MongoDB Connector for Spark 10.x from Databricks

I'm able to connect to MongoDB using org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 and this code:df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", jdbcUrl)It works well, but if I install last MongoDB Spark Connector ve...

Data Engineering

6653 Views
9 replies
6 kudos

12-23-2022 7:44:50 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 11:08:02 PM

6 kudos

Hi @Abel_Martinez, I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

6 kudos

11-20-2023 11:08:02 PM

8 More Replies

by 7cb15 • New Contributor

11-03-2023 4:44:56 PM

6338 Views
1 replies
0 kudos

Resolved! org.apache.spark.SparkException: Job aborted due to stage failure while saving to s3

Hello, I am having issues saving a spark dataframe generated in a databricks notebook to an s3 bucket. The dataframe contains approximately 1.1M rows and 5 columns. The error is as follows: org.apache.spark.SparkException: Job aborted due to stage fa...

Data Engineering

6338 Views
1 replies
0 kudos

11-03-2023 4:44:56 PM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 11:03:35 PM

0 kudos

Hi @7cb15, I understand you’re encountering issues while saving a Spark DataFrame to an S3 bucket. Let’s troubleshoot this together! Here are some steps and recommendations to address the problem: Check S3 Permissions: Ensure that the IAM role or us...

0 kudos

11-20-2023 11:03:35 PM

by svrdragon • New Contributor

11-04-2023 4:35:05 AM

742 Views
1 replies
0 kudos

Resolved! optimizeWrite takes too long

Hi , We have a spark job write data in delta table for last 90 date partition. We have enabled spark.databricks.delta.autoCompact.enabled and delta.autoOptimize.optimizeWrite. Job takes 50 mins to complete. In that logic takes 12 mins and optimizewri...

Data Engineering

742 Views
1 replies
0 kudos

11-04-2023 4:35:05 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 10:52:15 PM

0 kudos

Hi @svrdragon, It’s great you’re using Delta Lake features to optimize your Spark job. Let’s explore some strategies to reduce the total job time potentially: Optimize Write: You’ve already enabled delta.autoOptimize.optimizeWrite, which is a go...

0 kudos

11-20-2023 10:52:15 PM

by Rafal9 • New Contributor II

11-04-2023 10:04:32 AM

3099 Views
1 replies
0 kudos

Resolved! Issue during testing SparkSession.sql() with pytest.

Dear Community,I am testing pyspark code via pytest using VS code and Databricks Connect.SparkSession is initiated from Databricks Connect: from databricks.connect import DatabricksSessionspark = DatabricksSession.builder.getOrCreate()I am receiving...

Data Engineering

3099 Views
1 replies
0 kudos

11-04-2023 10:04:32 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 10:23:50 PM

0 kudos

Hi @Rafal9, Thank you for reaching out with your issue related to SparkSession.sql() and Databricks Connect. Let's explore potential solutions. Environment Configuration: Ensure that your environment variables and configurations are correctly set ...

0 kudos

11-20-2023 10:23:50 PM

by anmol_hans_de • New Contributor

11-03-2023 1:24:31 AM

3175 Views
2 replies
0 kudos

Resolved! Exam suspended by proctor

Hi Team,I need urgent support since I was about to submit my exam and was just reviewing the responses but proctor suspended it because i did not satisfy the proctoring conditions. Even though i was sitting in a room with clear background and well li...

Data Engineering

3175 Views
2 replies
0 kudos

11-03-2023 1:24:31 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 9:52:48 PM

0 kudos

Hi @anmol_hans_de, Please let us know if you need any further assistance!

0 kudos

11-20-2023 9:52:48 PM

1 More Replies

by Noosphera • New Contributor III

11-19-2023 6:43:11 PM

3238 Views
2 replies
1 kudos

Resolved! How to reinstantiate the Cloudformation template for AWS

Hi Everyone!I am new to Databricks, and had chosen to use the Cloudformation template to create my AWS Workspace. I regretfully must admit I felt creative in the process and varied the suggested stackname and that must have created errors which ended...

Data Engineering

AWS

Cloudformation template

Unity Catalog

3238 Views
2 replies
1 kudos

11-19-2023 6:43:11 PM

View Replies

Latest Reply

Kaniz
Community Manager

11-20-2023 9:09:32 PM

1 kudos

Hi @Noosphera, I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

1 kudos

11-20-2023 9:09:32 PM

1 More Replies

by ThomasVanBilsen • New Contributor III

08-03-2023 12:40:27 AM

3065 Views
2 replies
1 kudos

Default Catalog Name setting doesn't work

I've recently started using Unity Catalog and I'm trying to set the default catalog name to something else than the hive_metastore for some of my workspaces.According to the documentation (Update an assignment | Metastores API | REST API reference | ...

Data Engineering

Unity Catalog

3065 Views
2 replies
1 kudos

08-03-2023 12:40:27 AM

View Replies

Latest Reply

saldroubi
New Contributor II

11-20-2023 2:08:15 PM

1 kudos

I found that setting the default catalog in the workspace "Admin Settings" works for Sql warehouse, spark cluster and compute polices. Consult this documentation : https://docs.databricks.com/en/data-governance/unity-catalog/create-catalogs.html#view...

1 kudos

11-20-2023 2:08:15 PM

1 More Replies

by User16826994223 • Honored Contributor III

06-22-2021 2:17:45 AM

2215 Views
3 replies
1 kudos

TPC -DS test On databricks

If I want to run TPC-DS test on databricks what are the steps involved, do we have already daya available on databricks file system or I have to download or create from somewhere.

Data Engineering

2215 Views
3 replies
1 kudos

06-22-2021 2:17:45 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-22-2021 8:54:52 PM

1 kudos

See the spark-sql-perf repo for details on how to run benchmark tests using TPC-DS - https://github.com/databricks/spark-sql-perf

1 kudos

06-22-2021 8:54:52 PM

2 More Replies

User

Count

1602

736

343

284

247

Databricks

Forum Posts

Inconsistent Results When Writing to Oracle DB with Spark's dropDuplicates and foreachPartition

Resolved! Recover files from previous cluster execution

Resolved! Consume 2 kafka topic with different schemas on 1 cluster databricks

Resolved! Strategy to add new table base on silver data

Resolved! Databricks Asset Bundles Jobs Updated instead of Create

Resolved! Accessing Excel file from Databricks

Resolved! How to readstream from multiple sources?

Resolved! Why I'm getting connection timeout when connecting to MongoDB using MongoDB Connector for Spark 10.x from Databricks

Resolved! org.apache.spark.SparkException: Job aborted due to stage failure while saving to s3

Resolved! optimizeWrite takes too long

Resolved! Issue during testing SparkSession.sql() with pytest.

Resolved! Exam suspended by proctor

Resolved! How to reinstantiate the Cloudformation template for AWS

Default Catalog Name setting doesn't work

TPC -DS test On databricks

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...