Data Engineering

Forum Posts

Sorted by:

Start a conversation

by BorislavBlagoev • Valued Contributor III

01-21-2022 2:56:33 AM

1554 Views
2 replies
2 kudos

Resolved! Converting dataframe to delta.

Is it possible to convert the dataframe to a delta table without saving the dataframe on the storage?

Data Engineering

1554 Views
2 replies
2 kudos

01-21-2022 2:56:33 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

01-21-2022 6:33:24 AM

2 kudos

no, it will only be a delta table when writing it.

2 kudos

01-21-2022 6:33:24 AM

1 More Replies

by bluetail • Contributor

01-16-2022 7:20:16 AM

11443 Views
6 replies
5 kudos

Resolved! ModuleNotFoundError: No module named 'mlflow' when running a notebook

I am running a notebook on the Coursera platform.my configuration file, Classroom-Setup, looks like this:%python spark.conf.set("com.databricks.training.module-name", "deep-learning") spark.conf.set("com.databricks.training.expected-dbr", "6.4") ...

Data Engineering

11443 Views
6 replies
5 kudos

01-16-2022 7:20:16 AM

View Replies

Latest Reply

User16753724663
Valued Contributor

01-17-2022 4:21:24 AM

5 kudos

Hi @Maria Bruevich ,From the error description, it looks like the mlflow library is not present. You can use ML cluster as these type of cluster already have mlflow library. Please check the below document:https://docs.databricks.com/release-notes/r...

5 kudos

01-17-2022 4:21:24 AM

5 More Replies

by DanVartanian • New Contributor II

01-20-2022 4:05:33 AM

3160 Views
4 replies
1 kudos

Resolved! Help trying to calculate a percentage

The image below shows what my source data is (HAVE) and what I'm trying to get to (WANT).I want to be able to calculate the percentage of bad messages (where formattedMessage = false) by source and date.I'm not sure how to achieve this in DatabricksS...

Data Engineering

3160 Views
4 replies
1 kudos

01-20-2022 4:05:33 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

01-21-2022 12:28:02 AM

1 kudos

you could use a windows function over source and date with a sum of messagecount. This gives you the total per source/date repeated on every line.Then apply a filter on formattedmessage == false and divide messagecount by the sum above.

1 kudos

01-21-2022 12:28:02 AM

3 More Replies

by SettlerOfCatan • New Contributor

01-21-2022 5:34:25 AM

1510 Views
0 replies
0 kudos

Access data within the blob storage without downloading

Our customer is using Azure’s blob storage service to save big files so that we can work with them using an Azure online service, like Databricks.We want to read and work with these files with a computing resource obtained by Azure directly without d...

Data Engineering

1510 Views
0 replies
0 kudos

01-21-2022 5:34:25 AM

by Azure_Data_Eng1 • New Contributor

01-21-2022 2:40:32 AM

286 Views
0 replies
0 kudos

data=[['x', 20220118, 'FALSE', 3],['x', 20220118, 'TRUE', 97],['x', 20220119, 'FALSE', 1],['x'...

data=[['x', 20220118, 'FALSE', 3],['x', 20220118, 'TRUE', 97],['x', 20220119, 'FALSE', 1],['x', 20220119, 'TRUE', 49],['Y', 20220118, 'FALSE', 100],['Y', 20220118, 'TRUE', 900],['Y', 20220119, 'FALSE', 200],['Y', 20220119, 'TRUE', 800]]df=spark.creat...

Data Engineering

286 Views
0 replies
0 kudos

01-21-2022 2:40:32 AM

by prasadvaze • Valued Contributor

12-13-2021 1:01:31 PM

3307 Views
8 replies
2 kudos

Resolved! SQL endpoint is unable to connect to external hive metastore ( Azure databricks)

Using Azure databricks, I have set up SQL Endpoint with the connection details that match with global init script. I am able to browse tables from regular cluster in Data Engineering module but i get below error when trying a query using SQL Endpoint...

Data Engineering

3307 Views
8 replies
2 kudos

12-13-2021 1:01:31 PM

View Replies

Latest Reply

prasadvaze
Valued Contributor

12-19-2021 7:37:52 PM

2 kudos

@Prabakar Ammeappin @Kaniz Fatma Also I found out that after delta table is created in external metastore (and the table data resides in ADLS) then in the sql end point settings I do not need to provide ADLS connection details. I only provided...

2 kudos

12-19-2021 7:37:52 PM

7 More Replies

by Soma • Valued Contributor

12-13-2021 7:15:09 AM

2110 Views
3 replies
1 kudos

Resolved! AutoLoader with Custom Queue

Hi Everyone can someone help with creating custom queue for auto loader as given here as default FlushwithClose event is not getting created when my data is uploaded to blob as given in azure DB docscloudFiles.queueNameThe name of the Azure queue. If...

Data Engineering

2110 Views
3 replies
1 kudos

12-13-2021 7:15:09 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

12-13-2021 7:43:52 AM

1 kudos

you need to setup notification service for blob/adls like here https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html#cloud-resource-managementsetUpNotificationServices will return queue name which later can be used in au...

1 kudos

12-13-2021 7:43:52 AM

2 More Replies

by prasadvaze • Valued Contributor

01-06-2022 1:00:21 PM

4710 Views
2 replies
4 kudos

Resolved! How to select from a very large column ( string ) of delta table ?

In one of my delta table , the string column "abc" has 1753484 characters long value (string) . I get an error while selecting or transforming this column value ( in the downstream application). How do I solve this? SELECT ID, abc, length(abc) as ...

Data Engineering

4710 Views
2 replies
4 kudos

01-06-2022 1:00:21 PM

View Replies

Latest Reply

Kaniz
Community Manager

01-20-2022 11:38:42 PM

4 kudos

Hi @prasad vaze , try using the CHAR_LENGTH function.

4 kudos

01-20-2022 11:38:42 PM

1 More Replies

by sonali1996 • New Contributor

01-10-2022 4:06:22 AM

1914 Views
2 replies
0 kudos

Resolved! Multithreading in SCALA DATABRICKS

Hi Team, I was trying to call/run multiple notebooks in one notebook concurrent. But the caller notebooks are getting executing one by one whereas I need to run all the caller notebooks concurrently. I have also tried using Threading in Scala Databri...

Data Engineering

1914 Views
2 replies
0 kudos

01-10-2022 4:06:22 AM

View Replies

Latest Reply

Kaniz
Community Manager

01-20-2022 10:12:48 PM

0 kudos

Hi @Sonali Bhatt , This documentation might help you :-https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html

0 kudos

01-20-2022 10:12:48 PM

1 More Replies

by Soma • Valued Contributor

01-18-2022 3:28:09 AM

1148 Views
4 replies
2 kudos

Resolved! Query RestAPI end point in Databricks Standard Workspace

Do we have option to query delta table using Standard Workspace as a endpoint instead of JDBC

Data Engineering

1148 Views
4 replies
2 kudos

01-18-2022 3:28:09 AM

View Replies

Latest Reply

Anonymous
Not applicable

01-20-2022 8:48:13 AM

2 kudos

@somanath Sankaran - Would you be happy to mark @Hubert Dudek's answer as best if it resolved the problem? That helps other members who are searching for answers find the solution more quickly.

2 kudos

01-20-2022 8:48:13 AM

3 More Replies

by MattM • New Contributor III

01-17-2022 3:04:29 PM

1746 Views
5 replies
5 kudos

Resolved! Schema Parsing issue when datatype of source field is mapped incorrect

I have complex json file which has massive struct column. We regularly have issues when we try to parse this json file by forming our case class to extract the fields from schema. With this approach the issue we are facing is that if one data type of...

Data Engineering

1746 Views
5 replies
5 kudos

01-17-2022 3:04:29 PM

View Replies

Latest Reply

Anonymous
Not applicable

01-19-2022 8:43:13 AM

5 kudos

Hey there, @Matt M - If @Hubert Dudek's response solved the issue, would you be happy to mark his answer as best? It helps other members find the solution more quickly.

5 kudos

01-19-2022 8:43:13 AM

4 More Replies

by BorislavBlagoev • Valued Contributor III

01-19-2022 9:27:17 AM

2408 Views
9 replies
3 kudos

Resolved! Tring to create incremental pipeline but fails when I try to use outputMode "update"

def upsertToDelta(microBatchOutputDF, batchId): microBatchOutputDF.createOrReplaceTempView("updates") microBatchOutputDF._jdf.sparkSession().sql(""" MERGE INTO old o USING updates u ON u.id = o.id WHEN MATCHED THEN UPDATE SE...

Data Engineering

2408 Views
9 replies
3 kudos

01-19-2022 9:27:17 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-19-2022 9:52:57 AM

3 kudos

Delta table/file version is too old. Please try to upgrade it as described here https://docs.microsoft.com/en-us/azure/databricks/delta/versioning

3 kudos

01-19-2022 9:52:57 AM

8 More Replies

by Alex_Persin • New Contributor II

10-28-2021 2:59:06 AM

3015 Views
2 replies
2 kudos

How can the shared memory size (/dev/shm) be increased on databricks worker nodes with custom docker images?

PyTorch uses shared memory to efficiently share tensors between its dataloader workers and its main process. However in a docker container the default size of the shared memory (a tmpfs file system mounted at /dev/shm) is 64MB, which is too small to ...

Data Engineering

3015 Views
2 replies
2 kudos

10-28-2021 2:59:06 AM

View Replies

Latest Reply

mstuder
New Contributor II

01-19-2022 7:25:46 AM

2 kudos

Also interested in increasing shared memory for use with ray

2 kudos

01-19-2022 7:25:46 AM

1 More Replies

by hetadesai • New Contributor II

01-18-2022 6:10:19 AM

4385 Views
3 replies
4 kudos

Resolved! How to download zip file from SFTP location and put that file into Azure Data Lake and unzip there ?

I have zip file on SFTP location. I want to copy that file from SFTP location and put it into Azure Data lake and want to unzip there using spark notebook. Please help me to solve this.

Data Engineering

4385 Views
3 replies
4 kudos

01-18-2022 6:10:19 AM

View Replies

Latest Reply

Kaniz
Community Manager

01-20-2022 4:01:43 AM

4 kudos

Hi @heta desai , Did our suggestions help you?

4 kudos

01-20-2022 4:01:43 AM

2 More Replies

by Disney • New Contributor II

01-19-2022 10:52:53 AM

698 Views
1 replies
5 kudos

Resolved! We have hundreds of ETL process (Informatica) with a lot of logic pulling various data from applications into a relational db (Target DB). Can we use Delta Lake as the Target DB?

Hi DB Support,Can we use DB's Delta Lake as our Target DB? Here's our situation...We have hundreds of ETL jobs pulling from these Sources. (SAP, Siebel/Oracle, Cognos, Postgres) .Our ETL Process has all of the logic and our Target DB is an MPP syst...

Data Engineering

698 Views
1 replies
5 kudos

01-19-2022 10:52:53 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-20-2022 2:42:04 AM

5 kudos

Hi yes you can the best is to create sql endpoint in premium workspace and just write to delta lake as to sql. This is community forum not support. You can contact databricks via https://databricks.com/company/contact or via AWS, Azure if you have su...

5 kudos

01-20-2022 2:42:04 AM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Resolved! Converting dataframe to delta.

Resolved! ModuleNotFoundError: No module named 'mlflow' when running a notebook

Resolved! Help trying to calculate a percentage

Access data within the blob storage without downloading

data=[['x', 20220118, 'FALSE', 3],['x', 20220118, 'TRUE', 97],['x', 20220119, 'FALSE', 1],['x'...

Resolved! SQL endpoint is unable to connect to external hive metastore ( Azure databricks)

Resolved! AutoLoader with Custom Queue

Resolved! How to select from a very large column ( string ) of delta table ?

Resolved! Multithreading in SCALA DATABRICKS

Resolved! Query RestAPI end point in Databricks Standard Workspace

Resolved! Schema Parsing issue when datatype of source field is mapped incorrect

Resolved! Tring to create incremental pipeline but fails when I try to use outputMode "update"

How can the shared memory size (/dev/shm) be increased on databricks worker nodes with custom docker images?

Resolved! How to download zip file from SFTP location and put that file into Azure Data Lake and unzip there ?

Resolved! We have hundreds of ETL process (Informatica) with a lot of logic pulling various data from applications into a relational db (Target DB). Can we use Delta Lake as the Target DB?

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...