cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

RiyazAli
by Valued Contributor
  • 6566 Views
  • 3 replies
  • 3 kudos

Is there a way to CONCAT two dataframes on either of the axis (row/column) and transpose the dataframe in PySpark?

I'm reshaping my dataframe as per requirement and I came across this situation where I'm concatenating 2 dataframes and then transposing them. I've done this previously using pandas and the syntax for pandas goes as below:import pandas as pd   df1 = ...

  • 6566 Views
  • 3 replies
  • 3 kudos
Latest Reply
RiyazAli
Valued Contributor
  • 3 kudos

Hi @Kaniz Fatma​ ,I no longer see the answer you've posted, but I see you were suggesting to use `union`. As per my understanding, union are used to stack the dfs one upon another with similar schema / column names.In my situation, I have 2 different...

  • 3 kudos
2 More Replies
PawanShukla
by New Contributor III
  • 1199 Views
  • 1 replies
  • 0 kudos

Workflow Pipeline in Azure Databrick is throwing error for EventHubsSourceProvider could not be instantiated

I am using the sample code which is available in getting start tutorial. And it is simple read the json file and move in another table. But it is throwing error related to EventHubsSourceProvider 

  • 1199 Views
  • 1 replies
  • 0 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 0 kudos

This widget could not be displayed.
I am using the sample code which is available in getting start tutorial. And it is simple read the json file and move in another table. But it is throwing error related to EventHubsSourceProvider 

This widget could not be displayed.
  • 0 kudos
This widget could not be displayed.
Maverick1
by Valued Contributor II
  • 4835 Views
  • 3 replies
  • 6 kudos

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where? For non dated partitions, this is really a mess with delta tables.

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where. For non dated partitions, this is really a mess with delta tables.Most of my DE teams don't want to adopt delta because of these gl...

  • 4835 Views
  • 3 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hi @Saurabh Verma​ following up did you get a chance to check @Hubert Dudek​ previous comments ?

  • 6 kudos
2 More Replies
Anonymous
by Not applicable
  • 1651 Views
  • 1 replies
  • 1 kudos

Query silently failed

Hello all, I'm using the older 6.4 runtime and noticed that a query return no result whereas the same query on 10.4 provided the expected result. This is bad, because I got no error, simply no result at all.Is there is some spark settings on the clus...

  • 1651 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Alessio Palma​ following up did you get chance to check @Kaniz Fatma​ 's previous comments ?

  • 1 kudos
Jack
by New Contributor II
  • 4341 Views
  • 1 replies
  • 1 kudos

Append an empty dataframe to a list of dataframes using for loop in python

I have the following 3 dataframes:I want to append df_forecast to each of df2_CA and df2_USA using a for-loop. However when I run my code, df_forecast is not appending: df2_CA and df2_USA appear exactly as shown above.Here’s the code:df_list=[df2_CA,...

image image
  • 4341 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16764241763
Honored Contributor
  • 1 kudos

@Jack Homareau​  Can you try union functionality with dataframes?https://sparkbyexamples.com/pyspark/pyspark-union-and-unionall/and then try to fill NaNs with the desired values?

  • 1 kudos
VM
by Contributor
  • 3866 Views
  • 4 replies
  • 2 kudos

Error using Synapse ML: JavaPackage object is not callable

I am using DBR version 10.1. I want to use Synapse ML package. I am able to install and import it by following instructions on the link: https://github.com/microsoft/SynapseML. However when I try to run the code it gives me the error shown in the att...

  • 3866 Views
  • 4 replies
  • 2 kudos
Latest Reply
User16764241763
Honored Contributor
  • 2 kudos

Hello @Vikram Mahawal​ Clusters need to be in the running state to install/uninstall the libraries. Could you please start the cluster and try installing it.If you are still stuck, please file a support case with us, so we can take a look.Thanks

  • 2 kudos
3 More Replies
Vadim1
by New Contributor III
  • 2438 Views
  • 3 replies
  • 1 kudos

Resolved! Connect from Databricks to Hbase HDinsight cluster.

Hi, I have Databricks installation in Azure. I want to run a job that connects to HBase in a separate HDinsight cluster.What I tried:Created a peering between base cluster and Databricks vNets.I can ping IPs of Hbase zookeeper nodes but I cannot acce...

  • 2438 Views
  • 3 replies
  • 1 kudos
Latest Reply
User16764241763
Honored Contributor
  • 1 kudos

Vadim, Thank you for the response. Appreciate it.

  • 1 kudos
2 More Replies
lizou
by Contributor II
  • 1390 Views
  • 2 replies
  • 2 kudos

Merge into and data loss

I have a delta table with 20 M rows, Ther table is being updated dozens of times per day. The merge into is used, and the merge works fine for 1 year. But recently I begin notice some of data is deleted from merge into without delete specified. Mer...

  • 1390 Views
  • 2 replies
  • 2 kudos
Latest Reply
lizou
Contributor II
  • 2 kudos

I can't reproduce the issue anymore. for now, I am going to limit the number of merge into commands as intermediate data transformation does not need versioning history. I am going to try to use combined views for each step, and do a one-time merge i...

  • 2 kudos
1 More Replies
shan_chandra
by Databricks Employee
  • 4843 Views
  • 1 replies
  • 1 kudos

Resolved! Insert query fails with error "The query is not executed because it tries to launch ***** tasks in a single stage, while maximum allowed tasks one query can launch is 100000;

Py4JJavaError: An error occurred while calling o236.sql. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201) at org.apache.spark.sql.execution.datasources.I...

  • 4843 Views
  • 1 replies
  • 1 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 1 kudos

could you please increase the below config (at the cluster level) to a higher value or set it to zero spark.databricks.queryWatchdog.maxQueryTasks 0The spark config while it alleviates the issue.

  • 1 kudos
PradeepRavi
by New Contributor III
  • 33297 Views
  • 6 replies
  • 10 kudos

How do I prevent _success and _committed files in my write output?

Is there a way to prevent the _success and _committed files in my output. It's a tedious task to navigate to all the partitions and delete the files. Note : Final output is stored in Azure ADLS

  • 33297 Views
  • 6 replies
  • 10 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 10 kudos

Please find the below steps to remove _SUCCESS, _committed and _started files.spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") to remove success file.run vacuum command multiple times until _committed and _started files...

  • 10 kudos
5 More Replies
auser85
by New Contributor III
  • 2076 Views
  • 3 replies
  • 1 kudos

dbutils.notebook.run() fails with job aborted but running the notebook individually works

I have a notebook that runs many notebooks in order, along the lines of:```%pythonnotebook_list = ['Notebook1', 'Notebook2']   for notebook in notebook_list:  print(f"Now on Notebook: {notebook}")  try:    dbutils.notebook.run(f'{notebook}', 3600)  e...

  • 2076 Views
  • 3 replies
  • 1 kudos
Latest Reply
auser85
New Contributor III
  • 1 kudos

I found the problem. Even if a notebook creates and specifies a widget fully, the notebook run process, e.g, dbutils.notebook.run('notebook') will not know how to use it. If I replace my widget with a non-widget provided value, the process works fine...

  • 1 kudos
2 More Replies
pieseautoford
by New Contributor
  • 505 Views
  • 0 replies
  • 0 kudos

www.pieseford.ro

Hi, my name is Jerry Maguire and I`m automatic engineer at Piese Ford. Piese originale Ford Fiesta 2008-2012

  • 505 Views
  • 0 replies
  • 0 kudos
jwilliam
by Contributor
  • 4317 Views
  • 4 replies
  • 2 kudos

Resolved! How to view the SQL Query History of traditional Databricks cluster (not Databricks SQL)?

I tried use the Spark Cluster UI. But the queries are truncated.

  • 4317 Views
  • 4 replies
  • 2 kudos
Latest Reply
walkermaster12
New Contributor II
  • 2 kudos

In Apache Spark prior to 2.1, once a SQL query was run, there was no way to re-run it; all history was lost. Spark SQL introduced the "replay" functionality in Spark 2.1.0, enabling users to re-run any query they have already run. You can run a query...

  • 2 kudos
3 More Replies
Phani1
by Valued Contributor II
  • 3171 Views
  • 2 replies
  • 3 kudos

Resolved! Terminated with exception: Could not initialize class org.rocksdb.Options

Problem Statement : When running Delta Live tables ,it is giving the error.Error Message : Could not initialize class org.rocksdb.Optionsorg.apache.spark.sql.streaming.StreamingQueryException: Query cpicpg_us_tgt_amz_bronze [id = a42eec82-0ee8-41b4-9...

  • 3171 Views
  • 2 replies
  • 3 kudos
Latest Reply
Phani1
Valued Contributor II
  • 3 kudos

Hi Team ,Thanks for your response, I faced this issue while executing the Delta Live tables / pipeline.Initially i choose product edition as Core and attached 4 notebooks to the pipeline and each notebook have Bronze and silver tables creation. duri...

  • 3 kudos
1 More Replies
Phani1
by Valued Contributor II
  • 5207 Views
  • 1 replies
  • 0 kudos

Execute tasks parallel to process multiple files parallel

Hi all, If we have multiple tasks under the job, How to invoke a specific task under a job.Do we have any API to invoke Job and its specific tasks instead of Job.Use case: When we receive multiple messages from the event hub, each underlying task in ...

  • 5207 Views
  • 1 replies
  • 0 kudos
Latest Reply
Phani1
Valued Contributor II
  • 0 kudos

Thanks for your response, My question is ,if we have multiple tasks in a job ,How can we invoke specific task, I can see API to invoke the job but not a particular task in it. Kindly find attachment for your reference.

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels