Data Engineering

Forum Posts

Sorted by:

by TrinaDe • New Contributor II

07-15-2021 8:11:23 AM

2874 Views
2 replies
1 kudos

How can we join two pyspark dataframes side by side (without using join,equivalent to pd.concat() in pandas) ? I am trying to join two extremely large dataframes where each is of the order of 50 million.

My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2: The code I have tried is the following: If I print the top 5 rows of new_df2, it gives the output as expected but I cannot pri...

Data Engineering

2874 Views
2 replies
1 kudos

07-15-2021 8:11:23 AM

View Replies

Latest Reply

TrinaDe
New Contributor II

07-15-2021 8:21:19 AM

1 kudos

The code in a more legible format:

1 kudos

07-15-2021 8:21:19 AM

1 More Replies

by PawanShukla • New Contributor III

05-26-2022 2:18:24 AM

687 Views
2 replies
0 kudos

Workflow Pipeline in Azure Databrick is throwing error for EventHubsSourceProvider could not be instantiated

I am using the sample code which is available in getting start tutorial. And it is simple read the json file and move in another table. But it is throwing error related to EventHubsSourceProvider

Data Engineering

687 Views
2 replies
0 kudos

05-26-2022 2:18:24 AM

View Replies

Latest Reply

Kaniz
Community Manager

06-05-2022 11:27:34 PM

0 kudos

Hi @Pawan Shukla, Can you try restarting the cluster?

0 kudos

06-05-2022 11:27:34 PM

1 More Replies

by Maverick1 • Valued Contributor II

05-20-2022 3:37:02 AM

2734 Views
3 replies
6 kudos

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where? For non dated partitions, this is really a mess with delta tables.

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where. For non dated partitions, this is really a mess with delta tables.Most of my DE teams don't want to adopt delta because of these gl...

Data Engineering

2734 Views
3 replies
6 kudos

05-20-2022 3:37:02 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-06-2022 5:57:43 AM

6 kudos

Hi @Saurabh Verma following up did you get a chance to check @Hubert Dudek previous comments ?

6 kudos

06-06-2022 5:57:43 AM

2 More Replies

by Anonymous • Not applicable

05-20-2022 2:31:45 AM

961 Views
3 replies
1 kudos

Query silently failed

Hello all, I'm using the older 6.4 runtime and noticed that a query return no result whereas the same query on 10.4 provided the expected result. This is bad, because I got no error, simply no result at all.Is there is some spark settings on the clus...

Data Engineering

961 Views
3 replies
1 kudos

05-20-2022 2:31:45 AM

View Replies

Latest Reply

Kaniz
Community Manager

06-06-2022 6:04:19 AM

1 kudos

Hi @Alessio Palma, We haven’t heard from you on the last response from me, and I was checking back to see if you have a resolution yet. If you have any solution, please do share that same with the community as it can be helpful to others. Otherwise...

1 kudos

06-06-2022 6:04:19 AM

2 More Replies

by VM • Contributor

04-25-2022 6:23:22 AM

2418 Views
4 replies
2 kudos

Error using Synapse ML: JavaPackage object is not callable

I am using DBR version 10.1. I want to use Synapse ML package. I am able to install and import it by following instructions on the link: https://github.com/microsoft/SynapseML. However when I try to run the code it gives me the error shown in the att...

Data Engineering

2418 Views
4 replies
2 kudos

04-25-2022 6:23:22 AM

View Replies

Latest Reply

User16764241763
Honored Contributor

06-05-2022 9:24:27 PM

2 kudos

Hello @Vikram Mahawal Clusters need to be in the running state to install/uninstall the libraries. Could you please start the cluster and try installing it.If you are still stuck, please file a support case with us, so we can take a look.Thanks

2 kudos

06-05-2022 9:24:27 PM

3 More Replies

by Vadim1 • New Contributor III

05-30-2022 5:57:10 AM

1321 Views
3 replies
1 kudos

Resolved! Connect from Databricks to Hbase HDinsight cluster.

Hi, I have Databricks installation in Azure. I want to run a job that connects to HBase in a separate HDinsight cluster.What I tried:Created a peering between base cluster and Databricks vNets.I can ping IPs of Hbase zookeeper nodes but I cannot acce...

Data Engineering

1321 Views
3 replies
1 kudos

05-30-2022 5:57:10 AM

View Replies

Latest Reply

User16764241763
Honored Contributor

06-05-2022 8:37:30 PM

1 kudos

Vadim, Thank you for the response. Appreciate it.

1 kudos

06-05-2022 8:37:30 PM

2 More Replies

by lizou • Contributor II

05-15-2022 8:55:31 PM

860 Views
2 replies
2 kudos

Merge into and data loss

I have a delta table with 20 M rows, Ther table is being updated dozens of times per day. The merge into is used, and the merge works fine for 1 year. But recently I begin notice some of data is deleted from merge into without delete specified. Mer...

Data Engineering

860 Views
2 replies
2 kudos

05-15-2022 8:55:31 PM

View Replies

Latest Reply

lizou
Contributor II

06-05-2022 3:38:00 PM

2 kudos

I can't reproduce the issue anymore. for now, I am going to limit the number of merge into commands as intermediate data transformation does not need versioning history. I am going to try to use combined views for each step, and do a one-time merge i...

2 kudos

06-05-2022 3:38:00 PM

1 More Replies

by shan_chandra • Honored Contributor III

06-04-2022 12:11:17 PM

3180 Views
1 replies
1 kudos

Resolved! Insert query fails with error "The query is not executed because it tries to launch ***** tasks in a single stage, while maximum allowed tasks one query can launch is 100000;

Py4JJavaError: An error occurred while calling o236.sql. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201) at org.apache.spark.sql.execution.datasources.I...

Data Engineering

3180 Views
1 replies
1 kudos

06-04-2022 12:11:17 PM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

06-04-2022 12:21:57 PM

1 kudos

could you please increase the below config (at the cluster level) to a higher value or set it to zero spark.databricks.queryWatchdog.maxQueryTasks 0The spark config while it alleviates the issue.

1 kudos

06-04-2022 12:21:57 PM

by PradeepRavi • New Contributor III

08-01-2018 9:36:24 PM

25256 Views
6 replies
9 kudos

How do I prevent _success and _committed files in my write output?

Is there a way to prevent the _success and _committed files in my output. It's a tedious task to navigate to all the partitions and delete the files. Note : Final output is stored in Azure ADLS

Data Engineering

25256 Views
6 replies
9 kudos

08-01-2018 9:36:24 PM

View Replies

Latest Reply

shan_chandra
Honored Contributor III

06-04-2022 11:57:58 AM

9 kudos

Please find the below steps to remove _SUCCESS, _committed and _started files.spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") to remove success file.run vacuum command multiple times until _committed and _started files...

9 kudos

06-04-2022 11:57:58 AM

5 More Replies

by auser85 • New Contributor III

06-03-2022 8:53:23 AM

957 Views
3 replies
1 kudos

dbutils.notebook.run() fails with job aborted but running the notebook individually works

I have a notebook that runs many notebooks in order, along the lines of:```%pythonnotebook_list = ['Notebook1', 'Notebook2'] for notebook in notebook_list: print(f"Now on Notebook: {notebook}") try: dbutils.notebook.run(f'{notebook}', 3600) e...

Data Engineering

957 Views
3 replies
1 kudos

06-03-2022 8:53:23 AM

View Replies

Latest Reply

auser85
New Contributor III

06-04-2022 5:12:07 AM

1 kudos

I found the problem. Even if a notebook creates and specifies a widget fully, the notebook run process, e.g, dbutils.notebook.run('notebook') will not know how to use it. If I replace my widget with a non-widget provided value, the process works fine...

1 kudos

06-04-2022 5:12:07 AM

2 More Replies

by pieseautoford • New Contributor

06-03-2022 10:49:54 PM

310 Views
0 replies
0 kudos

www.pieseford.ro

Hi, my name is Jerry Maguire and I`m automatic engineer at Piese Ford. Piese originale Ford Fiesta 2008-2012

Data Engineering

310 Views
0 replies
0 kudos

06-03-2022 10:49:54 PM

by jwilliam • Contributor

05-27-2022 2:08:07 AM

3002 Views
4 replies
2 kudos

Resolved! How to view the SQL Query History of traditional Databricks cluster (not Databricks SQL)?

I tried use the Spark Cluster UI. But the queries are truncated.

Data Engineering

3002 Views
4 replies
2 kudos

05-27-2022 2:08:07 AM

View Replies

Latest Reply

walkermaster12
New Contributor II

05-31-2022 3:09:44 AM

2 kudos

In Apache Spark prior to 2.1, once a SQL query was run, there was no way to re-run it; all history was lost. Spark SQL introduced the "replay" functionality in Spark 2.1.0, enabling users to re-run any query they have already run. You can run a query...

2 kudos

05-31-2022 3:09:44 AM

3 More Replies

by Phani1 • Valued Contributor

06-01-2022 1:20:36 AM

2130 Views
3 replies
3 kudos

Resolved! Terminated with exception: Could not initialize class org.rocksdb.Options

Problem Statement : When running Delta Live tables ,it is giving the error.Error Message : Could not initialize class org.rocksdb.Optionsorg.apache.spark.sql.streaming.StreamingQueryException: Query cpicpg_us_tgt_amz_bronze [id = a42eec82-0ee8-41b4-9...

Data Engineering

2130 Views
3 replies
3 kudos

06-01-2022 1:20:36 AM

View Replies

Latest Reply

Phani1
Valued Contributor

06-03-2022 10:25:08 AM

3 kudos

Hi Team ,Thanks for your response, I faced this issue while executing the Delta Live tables / pipeline.Initially i choose product edition as Core and attached 4 notebooks to the pipeline and each notebook have Bronze and silver tables creation. duri...

3 kudos

06-03-2022 10:25:08 AM

2 More Replies

by Phani1 • Valued Contributor

05-11-2022 11:44:54 PM

3698 Views
2 replies
0 kudos

Execute tasks parallel to process multiple files parallel

Hi all, If we have multiple tasks under the job, How to invoke a specific task under a job.Do we have any API to invoke Job and its specific tasks instead of Job.Use case: When we receive multiple messages from the event hub, each underlying task in ...

Data Engineering

3698 Views
2 replies
0 kudos

05-11-2022 11:44:54 PM

View Replies

Latest Reply

Phani1
Valued Contributor

06-03-2022 10:17:16 AM

0 kudos

Thanks for your response, My question is ,if we have multiple tasks in a job ,How can we invoke specific task, I can see API to invoke the job but not a particular task in it. Kindly find attachment for your reference.

0 kudos

06-03-2022 10:17:16 AM

1 More Replies

by klllmmm • New Contributor II

05-24-2022 9:22:22 AM

2584 Views
3 replies
1 kudos

Error as no such file when reading CSV file using pandas

I'm trying to read a CSV file saved in data using pandas read_csv function. But it gives No such file error.%fs ls /FileStore/tables/ df= pd.read_csv('/dbfs/FileStore/tables/CREDIT_1.CSV') df= pd.read_csv('/dbfs:/FileStore/tables/CREDIT_1.CSV')...

Data Engineering

2584 Views
3 replies
1 kudos

05-24-2022 9:22:22 AM

View Replies

Latest Reply

klllmmm
New Contributor II

06-03-2022 9:33:44 AM

1 kudos

Thanks to @Werner Stinckens for the answer.I understood that I have to use spark to read data from clusters.

1 kudos

06-03-2022 9:33:44 AM

2 More Replies

User

Count

1601

736

343

284

246

Databricks

Forum Posts

How can we join two pyspark dataframes side by side (without using join,equivalent to pd.concat() in pandas) ? I am trying to join two extremely large dataframes where each is of the order of 50 million.

Workflow Pipeline in Azure Databrick is throwing error for EventHubsSourceProvider could not be instantiated

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where? For non dated partitions, this is really a mess with delta tables.

Query silently failed

Error using Synapse ML: JavaPackage object is not callable

Resolved! Connect from Databricks to Hbase HDinsight cluster.

Merge into and data loss

Resolved! Insert query fails with error "The query is not executed because it tries to launch ***** tasks in a single stage, while maximum allowed tasks one query can launch is 100000;

How do I prevent _success and _committed files in my write output?

dbutils.notebook.run() fails with job aborted but running the notebook individually works

www.pieseford.ro

Resolved! How to view the SQL Query History of traditional Databricks cluster (not Databricks SQL)?

Resolved! Terminated with exception: Could not initialize class org.rocksdb.Options

Execute tasks parallel to process multiple files parallel

Error as no such file when reading CSV file using pandas

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...