Data Engineering

Forum Posts

Sorted by:

by Kratik • New Contributor III

09-07-2023 10:45:50 AM

966 Views
1 replies
0 kudos

Spark submit job running python file

I have a spark submit job which is running one python file called main.py.The other file is alert.py which is being imported in main.py.Also main.py is using multiple config files.Alert.py is passed in --py-files and other config files are passed as ...

Data Engineering

pyfiles

spark

submit

966 Views
1 replies
0 kudos

09-07-2023 10:45:50 AM

View Replies

Latest Reply

Kaniz
Community Manager

09-11-2023 4:17:58 AM

0 kudos

Hi @Kratik, To run the Spark submit job in Databricks and pass the --py-files and --files options, you can use the dbx command-line tool.

0 kudos

09-11-2023 4:17:58 AM

by TimB • New Contributor II

09-07-2023 2:30:58 PM

2742 Views
1 replies
0 kudos

Create external table using multiple paths/locations

I want to create an external table from more than a single path. I have configured my storage creds and added an external location, and I can successfully create a table using the following code;create table test.base.Example using csv options ( h...

Data Engineering

2742 Views
1 replies
0 kudos

09-07-2023 2:30:58 PM

View Replies

Latest Reply

Kaniz
Community Manager

09-11-2023 4:09:23 AM

0 kudos

Hi @TimB, you can import data from multiple paths using wildcards or similar patterns when creating an external table in Databricks. To import data from multiple paths using wildcards, you can modify the location parameter in the CREATE TABLE stateme...

0 kudos

09-11-2023 4:09:23 AM

by marcuskw • Contributor

09-08-2023 6:44:20 AM

1453 Views
2 replies
1 kudos

Resolved! whenNotMatchedBySourceUpdate ConcurrentAppendException Partition

ConcurrentAppendException requires a good partitioning strategy, here my logic works without fault for "whenMatchedUpdate" and "whenNotMatchedInsert" logic. When using "whenNotMatchedBySourceUpdate" however it seems that the condition doesn't isolate...

Data Engineering

1453 Views
2 replies
1 kudos

09-08-2023 6:44:20 AM

View Replies

Latest Reply

Kaniz
Community Manager

09-09-2023 12:14:58 AM

1 kudos

Hi @marcuskw, Based on the provided information and the given code snippet, it seems that the condition in the whenNotMatchedBySourceUpdate The clause does not isolate the specific partition in the Delta table. This can lead to a ConcurrentAppendExc...

1 kudos

09-09-2023 12:14:58 AM

1 More Replies

by Ajay-Pandey • Esteemed Contributor III

05-27-2023 11:07:44 AM

2719 Views
5 replies
0 kudos

How we can send databricks log to Azure Application Insight ?

Hi All,I want to send databricks logs to azure application insight.Is there any way we can do it ??Any blog or doc will help me.

Data Engineering

2719 Views
5 replies
0 kudos

05-27-2023 11:07:44 AM

View Replies

Latest Reply

floringrigoriu
New Contributor II

09-08-2023 2:35:36 PM

0 kudos

hi @Debayan in the https://learn.microsoft.com/en-us/azure/architecture/databricks-monitoring/application-logs. there is a github repository mentioned https://github.com/mspnp/spark-monitoring ? That repository is marked as maintainance mode. Just...

0 kudos

09-08-2023 2:35:36 PM

4 More Replies

by pvm26042000 • New Contributor III

12-27-2022 3:12:42 AM

1728 Views
4 replies
2 kudos

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

Data Engineering

1728 Views
4 replies
2 kudos

12-27-2022 3:12:42 AM

View Replies

Latest Reply

Sai1098
New Contributor II

09-10-2023 12:55:03 PM

2 kudos

Vectorized Pandas UDFs offer improved performance compared to standard PySpark UDFs by leveraging the power of Pandas and operating on entire columns of data at once, rather than row by row.They provide a more intuitive and familiar programming inter...

2 kudos

09-10-2023 12:55:03 PM

3 More Replies

by pranavyadavbugy • New Contributor

06-21-2023 10:59:20 AM

1710 Views
2 replies
0 kudos

Regarding Discount on certifications for students

Hi team,I'm a student is there any student discounts for students on certification if yes please let me know.Thanks

Data Engineering

1710 Views
2 replies
0 kudos

06-21-2023 10:59:20 AM

View Replies

Latest Reply

FeliciaWilliam
New Contributor III

09-10-2023 2:39:07 AM

0 kudos

Exciting news for students! Enjoy special discounts on certifications. If you need more study resources check out Chegg Study alternatives on https://edureviewer.com/sites-like-chegg-study/ for extra support in the middle of your academic journey. It...

0 kudos

09-10-2023 2:39:07 AM

1 More Replies

by User15787040559 • New Contributor III

06-18-2021 9:52:47 AM

1035 Views
2 replies
0 kudos

How to translate Apache Pig FILTER statement to Spark?

If you have the following Apache Pig FILTER statement:XCOCD_ACT_Y = FILTER XCOCD BY act_ind == 'Y';the equivalent code in Apache Spark is:XCOCD_ACT_Y_DF = (XCOCD_DF .filter(col("act_ind") == "Y"))

Data Engineering

1035 Views
2 replies
0 kudos

06-18-2021 9:52:47 AM

View Replies

Latest Reply

FeliciaWilliam
New Contributor III

09-09-2023 11:54:41 AM

0 kudos

Translating an Apache Pig FILTER statement to Spark requires understanding the differences in syntax and functionality between the two processing frameworks. While both aim to filter data, Spark uses a different syntax and approach, typically involvi...

0 kudos

09-09-2023 11:54:41 AM

1 More Replies

by narvinya • New Contributor

09-08-2023 9:58:15 AM

1540 Views
1 replies
0 kudos

Resolved! What is the best approach to use Delta tables without Unity Catalog enabled?

Hello!I would like to work with delta tables outside of Databricks UI notebook. I know that the best option would be to use databricks-connect but I don’t have Unity Catalog enabled.What would be the most effective way to do so? I know that via JDBC ...

Data Engineering

1540 Views
1 replies
0 kudos

09-08-2023 9:58:15 AM

View Replies

Latest Reply

Kaniz
Community Manager

09-09-2023 12:13:51 AM

0 kudos

Hi @narvinya, • Delta tables can be accessed outside of Databricks UI notebook without using Databricks-connect or Unity Catalog. • Three options are available for working with Delta tables outside of Databricks UI notebook: 1. Using JDBC: Read and...

0 kudos

09-09-2023 12:13:51 AM

by MUA • New Contributor

08-16-2023 12:23:48 PM

2157 Views
2 replies
1 kudos

OSError: [Errno 7] Argument list too long

Getting this error in Databricks and don't know how to solveOSError: [Errno 7] Argument list too long: '/dbfs/databricks/aaecz/dev/w000aaecz/etl-framework-adb/0.4.31-20230503.131701-1/etl_libraries/utils/datadog/restart_datadog.sh'if anyone can help

Data Engineering

2157 Views
2 replies
1 kudos

08-16-2023 12:23:48 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

09-08-2023 3:55:39 PM

1 kudos

@MUA Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

1 kudos

09-08-2023 3:55:39 PM

1 More Replies

by lawrence009 • Contributor

09-06-2023 11:16:27 PM

1177 Views
4 replies
2 kudos

Troubleshooting Spill

I am trying to troubleshoot why spill occurred during DeltaOptimizeWrite. I am running a 64-core cluster with 256 GB RAM, which I expect to be handle this amount data (see attached DAG).

Data Engineering

1177 Views
4 replies
2 kudos

09-06-2023 11:16:27 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

09-08-2023 3:46:17 PM

2 kudos

You can resolver the Spill to memory by increasing the shuffle partitions, but 16 GB of spill memory should not create a major impact of your job execution. Could you share more details on the actual source code that you are running?

2 kudos

09-08-2023 3:46:17 PM

3 More Replies

by JKR • New Contributor III

09-01-2023 9:49:35 PM

1686 Views
4 replies
1 kudos

Resolved! Got Failure: com.databricks.backend.common.rpc.SparkDriverExceptions$ReplFatalException error

Got below failure on scheduled job on interactive cluster and the next scheduled run executed fine.I want to know why this error occurred and how can I prevent it to happen again.And how to debug these errors in future ? com.databricks.backend.commo...

Data Engineering

1686 Views
4 replies
1 kudos

09-01-2023 9:49:35 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

09-08-2023 3:42:37 PM

1 kudos

@JKR Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

1 kudos

09-08-2023 3:42:37 PM

3 More Replies

by mbejarano89 • New Contributor III

08-17-2023 8:49:59 AM

630 Views
1 replies
1 kudos

Resolved! Cloning content of Repos into shared Workspace

Hello, I have a git repository on Databricks with notebooks that are meant to be shared with other users. The reason these notebooks are in git as opposed to the "shared" workspace already is because they are to be continuously improved and need sepa...

Data Engineering

630 Views
1 replies
1 kudos

08-17-2023 8:49:59 AM

View Replies

Latest Reply

User16539034020
Contributor II

09-08-2023 3:26:28 PM

1 kudos

Hello, Thanks for contacting Databricks Support. I presume you're looking to transfer files from external repositories to Databricks workspace. I'm afraid currently there is no direct support on it. You may consider to use REST API which allows for...

1 kudos

09-08-2023 3:26:28 PM

by Bagger • New Contributor II

08-31-2023 3:18:02 AM

681 Views
2 replies
1 kudos

Databricks Connect - driver error

Hi We are experiencing instability when executing queries using Databricks Connect.Sometimes we are unable to receive the full result set without encountering an error with the message "Driver is up but is not responsive..."When we run the same query...

Data Engineering

databricks connect

driver

681 Views
2 replies
1 kudos

08-31-2023 3:18:02 AM

View Replies

Latest Reply

Kaniz
Community Manager

08-31-2023 1:43:55 PM

1 kudos

Hi @Bagger, - Error message: "Driver is up but is not responsive..." - Potential causes: - Unreachable Cluster - Check workspace instance name and cluster ID - Verify environment variables on the local development machine - Python Ve...

1 kudos

08-31-2023 1:43:55 PM

1 More Replies

by vgupta • New Contributor II

03-12-2023 10:20:31 AM

2945 Views
5 replies
4 kudos

DLT | Cluster terminated by System-User | INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds

Dear Community, Hope you are doing well.For the last couple of days I am seeing very strange issues with my DLT pipeline, So every 60-70 mins it is getting failed in continuous mode, with the ERROR; INTERNAL_ERROR: Communication lost with driver. Clu...

Data Engineering

2945 Views
5 replies
4 kudos

03-12-2023 10:20:31 AM

View Replies

Latest Reply

Reddy-24
New Contributor II

09-08-2023 2:31:19 AM

4 kudos

Hello @Debayan , I am facing same issue, while running Delta live table, This job is running in produtcuion, but it's not working in dev, i have tried to increae the worker nodes but no use. Can you please help on this.

4 kudos

09-08-2023 2:31:19 AM

4 More Replies

by alonisser • Contributor

02-22-2022 6:55:46 AM

3049 Views
6 replies
3 kudos

Resolved! Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint

Question about spark checkpoints and offsets in a running streamWhen the stream started I needed tons of partitions, so we've set it with spark.conf to 5000As expected offsets in the checkpoint contain this info and the job used this value. Then we'...

Data Engineering

3049 Views
6 replies
3 kudos

02-22-2022 6:55:46 AM

View Replies

Latest Reply

Leszek
Contributor

09-08-2022 4:33:51 AM

3 kudos

@Jose Gonzalez thanks for that information! This is super useful. I was struggling why my streaming still using 200 partitions. This is quite a paint for me because changing checkpoint will insert all data from the source. Do you know where this can...

3 kudos

09-08-2022 4:33:51 AM

5 More Replies

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Spark submit job running python file

Create external table using multiple paths/locations

Resolved! whenNotMatchedBySourceUpdate ConcurrentAppendException Partition

How we can send databricks log to Azure Application Insight ?

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

Regarding Discount on certifications for students

How to translate Apache Pig FILTER statement to Spark?

Resolved! What is the best approach to use Delta tables without Unity Catalog enabled?

OSError: [Errno 7] Argument list too long

Troubleshooting Spill

Resolved! Got Failure: com.databricks.backend.common.rpc.SparkDriverExceptions$ReplFatalException error

Resolved! Cloning content of Repos into shared Workspace

Databricks Connect - driver error

DLT | Cluster terminated by System-User | INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds

Resolved! Changing shuffle.partitions with spark.conf in a spark stream - isn't respected even after a checkpoint

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...