Data Engineering

Forum Posts

Sorted by:

by darshan • New Contributor III

06-30-2022 12:41:09 AM

767 Views
3 replies
1 kudos

job init takes longer than notebook run

I am trying to understand why running a job takes longer than running the notebook manually.And if I try to run jobs concurrently using workflow or threads then is there a way to reduce job init time ?

Data Engineering

767 Views
3 replies
1 kudos

06-30-2022 12:41:09 AM

View Replies

Latest Reply

Vivian_Wilfred
Honored Contributor

08-04-2022 1:51:17 PM

1 kudos

Hi @darshan doshi , Jobs creates a job cluster in the backend before it starts the task execution and this cluster creation may take extra time when compared to running a notebook on a existing cluster.1) If you run a multi-task job, you could selec...

1 kudos

08-04-2022 1:51:17 PM

2 More Replies

by karthik_p • Esteemed Contributor

06-09-2022 7:26:01 PM

1473 Views
4 replies
9 kudos

Resolved! Unable to create Databricks workspace using Terraform on AWS

HI Team,we are using below workspace config scripts, when we try to create workspace previously from EC2 Instance, we are able to create Workspace without any issue. but when we are trying to run through Github actions, we are getting below errorErro...

Data Engineering

1473 Views
4 replies
9 kudos

06-09-2022 7:26:01 PM

View Replies

Latest Reply

Prabakar
Esteemed Contributor III

08-02-2022 2:31:03 AM

9 kudos

@karthik p this can be fixed by setting timeout. Please check this https://kb.databricks.com/en_US/cloud/failed-credential-validation-checks-error-with-terraform

9 kudos

08-02-2022 2:31:03 AM

3 More Replies

by MrsBaker • New Contributor II

08-08-2022 10:53:57 AM

808 Views
1 replies
1 kudos

display() not updating after 1000 rows

Hello folks! I am calling display() on a streaming query sourced from a delta table. The output from display() displays the new rows added to the source table. But as soon as the output results hit 1000 rows, the output is not updated anymore. As a r...

Data Engineering

808 Views
1 replies
1 kudos

08-08-2022 10:53:57 AM

View Replies

Latest Reply

MrsBaker
New Contributor II

08-08-2022 1:29:41 PM

1 kudos

aggregate function followed by timestamp field sorted in descending order did the trick:streaming_df.groupBy("field1", "time_field").max("field2").orderBy(col("time_field").desc()).display()

1 kudos

08-08-2022 1:29:41 PM

by KateK • New Contributor II

08-04-2022 9:15:41 AM

945 Views
3 replies
1 kudos

How do you correctly access the spark context in DLT pipelines?

I have some code that uses RDDs, and the sc.parallelize() and rdd.toDF() methods to get a dataframe back out. The code works in a regular notebook (and if I run the notebook as a job) but fails if I do the same thing in a DLT pipeline. The error mess...

Data Engineering

945 Views
3 replies
1 kudos

08-04-2022 9:15:41 AM

View Replies

Latest Reply

KateK
New Contributor II

08-08-2022 9:43:57 AM

1 kudos

Thanks for your help Alex, I ended up re-writing my code with spark UDFs -- maybe there is a better solution with only the Dataframe API but I couldn't find it. To summarize my problem: I was trying to un-nest a large json blob (the fake data in my f...

1 kudos

08-08-2022 9:43:57 AM

2 More Replies

by palak231 • New Contributor

08-08-2022 3:36:59 AM

363 Views
0 replies
0 kudos

A/B Testing:- A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of t...

A/B Testing:- A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of them. Well before doing A/B testing you need to be focus on one problem that you want to resolve and ...

Data Engineering

363 Views
0 replies
0 kudos

08-08-2022 3:36:59 AM

by Anonymous • Not applicable

08-03-2022 1:38:28 PM

489 Views
1 replies
4 kudos

Happy August! &#xd83d;&#xde0e; On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have...

Happy August! On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have the chance to connect as a community often. Come network, talk data, and just get social! Join us for our August ...

Data Engineering

489 Views
1 replies
4 kudos

08-03-2022 1:38:28 PM

View Replies

Latest Reply

Kaniz
Community Manager

08-07-2022 11:55:24 PM

4 kudos

Wow! Super Exciting.

4 kudos

08-07-2022 11:55:24 PM

by Rajendra • New Contributor II

08-07-2022 9:19:39 PM

860 Views
0 replies
2 kudos

Does databricks support writing the data in Iceberg format?

As I understand databricks supports conversion from Iceberg format to Delta using the command belowCONVERT TO DELTA iceberg.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`; -- uses Iceberg manifest for metadataHowever...

Data Engineering

860 Views
0 replies
2 kudos

08-07-2022 9:19:39 PM

by jakubk • Contributor

08-07-2022 6:01:55 PM

454 Views
0 replies
0 kudos

databricks spark sql Custom table valued function + struct really slow (minutes for a single row)

I'm using azure databricksI have a custom table valued function which takes a URL as a parameter and outputs a single row table with certain elements from the URL extracted/labelled(i get search activity URLs and when in a specific format I can retri...

Data Engineering

454 Views
0 replies
0 kudos

08-07-2022 6:01:55 PM

by sage5616 • Valued Contributor

08-03-2022 3:06:05 PM

9724 Views
3 replies
2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

Data Engineering

9724 Views
3 replies
2 kudos

08-03-2022 3:06:05 PM

View Replies

Latest Reply

Anonymous
Not applicable

08-07-2022 1:25:11 PM

2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

2 kudos

08-07-2022 1:25:11 PM

2 More Replies

by Ross • New Contributor II

08-04-2022 5:10:15 AM

1742 Views
4 replies
0 kudos

Failed to install cluster scoped SparkR library

Attempting to install SparkR to the cluster and successfully installed other packages such as tidyverse via CRAN. The error is copied below, any help you can provide is greatly appreciated!Databricks runtime 10.4 LTSLibrary installation attempted on ...

Data Engineering

1742 Views
4 replies
0 kudos

08-04-2022 5:10:15 AM

View Replies

Latest Reply

Vivian_Wilfred
Honored Contributor

08-04-2022 2:08:54 PM

0 kudos

Hi @Ross Hamilton ,I believe SparkR comes inbuilt with Databricks RStudio and you don't have to install it explicitly. You can directly import it with library(SparkR) and it works for you from your above comment.The error message you see could be re...

0 kudos

08-04-2022 2:08:54 PM

3 More Replies

by Anonymous • Not applicable

02-15-2022 9:36:28 AM

344 Views
1 replies
1 kudos

The Next Databricks Office HoursOur next Office Hours session is scheduled for February 23, 2022 - 8:00 am PDT Do you have questions about how to set ...

The Next Databricks Office HoursOur next Office Hours session is scheduled for February 23, 2022 - 8:00 am PDTDo you have questions about how to set up or use Databricks? Do you want to get best practices for deploying your use case or tips on data a...

Data Engineering

344 Views
1 replies
1 kudos

02-15-2022 9:36:28 AM

View Replies

Latest Reply

Kaniz
Community Manager

08-04-2022 8:00:10 AM

1 kudos

Great!

1 kudos

08-04-2022 8:00:10 AM

by ftc • New Contributor II

08-02-2022 1:22:08 PM

1495 Views
3 replies
0 kudos

Resolved! Multi-Hop Architecture for ingestion data via http API

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? ...

Data Engineering

1495 Views
3 replies
0 kudos

08-02-2022 1:22:08 PM

View Replies

Latest Reply

artsheiko
Valued Contributor III

08-03-2022 10:57:24 AM

0 kudos

The API -> Cloud Storage -> Delta is more suitable approach.Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference ev...

0 kudos

08-03-2022 10:57:24 AM

2 More Replies

by Hubert-Dudek • Esteemed Contributor III

05-05-2022 3:31:58 AM

469 Views
1 replies
24 kudos

The new Databricks jobs matrix is awesome! but looking at it can be addictive

The new Databricks jobs matrix is awesome!but looking at it can be addictive

Data Engineering

469 Views
1 replies
24 kudos

05-05-2022 3:31:58 AM

View Replies

Latest Reply

Kaniz
Community Manager

08-04-2022 7:28:16 AM

24 kudos

Thank you @Hubert Dudek for the fantastic post!

24 kudos

08-04-2022 7:28:16 AM

by ASN • New Contributor II

06-09-2022 2:39:24 PM

7329 Views
5 replies
2 kudos

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record...

Data Engineering

7329 Views
5 replies
2 kudos

06-09-2022 2:39:24 PM

View Replies

Latest Reply

Pholo
Contributor

08-04-2022 7:06:15 AM

2 kudos

Hi, I think you can use this option for the csvReadeespark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")especially the unescapedQuoteHandling. You can search for the other options at this l...

2 kudos

08-04-2022 7:06:15 AM

4 More Replies

by Rahul_Samant • Contributor

08-03-2022 2:03:06 AM

2232 Views
4 replies
1 kudos

Resolved! Spark Sql Connector :

i am trying to read data from azure sql database from databricks. azure sql database is created with private link endpoint.Using DBR 10.4 LTS Cluster and expectation is the connector is pre installed as per documentation.using the below code to fetch...

Data Engineering

2232 Views
4 replies
1 kudos

08-03-2022 2:03:06 AM

View Replies

Latest Reply

artsheiko
Valued Contributor III

08-03-2022 11:11:03 AM

1 kudos

It seems that .option("databaseName", "test") is redundant here as you need to include the db name in the url.Please verify that you use a connector compatible to your cluster's Spark version : Apache Spark connector: SQL Server & Azure SQL

1 kudos

08-03-2022 11:11:03 AM

3 More Replies

User

Count

1601

736

343

284

246

Databricks

Forum Posts

job init takes longer than notebook run

Resolved! Unable to create Databricks workspace using Terraform on AWS

display() not updating after 1000 rows

How do you correctly access the spark context in DLT pipelines?

A/B Testing:- A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of t...

Happy August! &#xd83d;&#xde0e; On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have...

Does databricks support writing the data in Iceberg format?

databricks spark sql Custom table valued function + struct really slow (minutes for a single row)

Resolved! Choosing the optimal cluster size/specs.

Failed to install cluster scoped SparkR library

The Next Databricks Office HoursOur next Office Hours session is scheduled for February 23, 2022 - 8:00 am PDT Do you have questions about how to set ...

Resolved! Multi-Hop Architecture for ingestion data via http API

The new Databricks jobs matrix is awesome! but looking at it can be addictive

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

Resolved! Spark Sql Connector :

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...