cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

darshan
by New Contributor III
  • 767 Views
  • 3 replies
  • 1 kudos

job init takes longer than notebook run

I am trying to understand why running a job takes longer than running the notebook manually.And if I try to run jobs concurrently using workflow or threads then is there a way to reduce job init time ?

  • 767 Views
  • 3 replies
  • 1 kudos
Latest Reply
Vivian_Wilfred
Honored Contributor
  • 1 kudos

Hi @darshan doshi​ , Jobs creates a job cluster in the backend before it starts the task execution and this cluster creation may take extra time when compared to running a notebook on a existing cluster.1) If you run a multi-task job, you could selec...

  • 1 kudos
2 More Replies
karthik_p
by Esteemed Contributor
  • 1473 Views
  • 4 replies
  • 9 kudos

Resolved! Unable to create Databricks workspace using Terraform on AWS

HI Team,we are using below workspace config scripts, when we try to create workspace previously from EC2 Instance, we are able to create Workspace without any issue. but when we are trying to run through Github actions, we are getting below errorErro...

  • 1473 Views
  • 4 replies
  • 9 kudos
Latest Reply
Prabakar
Esteemed Contributor III
  • 9 kudos

@karthik p​ this can be fixed by setting timeout. Please check this https://kb.databricks.com/en_US/cloud/failed-credential-validation-checks-error-with-terraform

  • 9 kudos
3 More Replies
MrsBaker
by New Contributor II
  • 808 Views
  • 1 replies
  • 1 kudos

display() not updating after 1000 rows

Hello folks! I am calling display() on a streaming query sourced from a delta table. The output from display() displays the new rows added to the source table. But as soon as the output results hit 1000 rows, the output is not updated anymore. As a r...

  • 808 Views
  • 1 replies
  • 1 kudos
Latest Reply
MrsBaker
New Contributor II
  • 1 kudos

aggregate function followed by timestamp field sorted in descending order did the trick:streaming_df.groupBy("field1", "time_field").max("field2").orderBy(col("time_field").desc()).display()

  • 1 kudos
KateK
by New Contributor II
  • 945 Views
  • 3 replies
  • 1 kudos

How do you correctly access the spark context in DLT pipelines?

I have some code that uses RDDs, and the sc.parallelize() and rdd.toDF() methods to get a dataframe back out. The code works in a regular notebook (and if I run the notebook as a job) but fails if I do the same thing in a DLT pipeline. The error mess...

  • 945 Views
  • 3 replies
  • 1 kudos
Latest Reply
KateK
New Contributor II
  • 1 kudos

Thanks for your help Alex, I ended up re-writing my code with spark UDFs -- maybe there is a better solution with only the Dataframe API but I couldn't find it. To summarize my problem: I was trying to un-nest a large json blob (the fake data in my f...

  • 1 kudos
2 More Replies
palak231
by New Contributor
  • 363 Views
  • 0 replies
  • 0 kudos

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of t...

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of them. Well before doing A/B testing you need to be focus on one problem that you want to resolve and ...

  • 363 Views
  • 0 replies
  • 0 kudos
Anonymous
by Not applicable
  • 489 Views
  • 1 replies
  • 4 kudos

Happy August! �� On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have...

Happy August! On August 25th we are hosting another Community Social - we're doing these monthly ! We want to make sure that we all have the chance to connect as a community often. Come network, talk data, and just get social! Join us for our August ...

  • 489 Views
  • 1 replies
  • 4 kudos
Latest Reply
Kaniz
Community Manager
  • 4 kudos

Wow! Super Exciting.

  • 4 kudos
Rajendra
by New Contributor II
  • 860 Views
  • 0 replies
  • 2 kudos

Does databricks support writing the data in Iceberg format?

As I understand databricks supports conversion from Iceberg format to Delta using the command belowCONVERT TO DELTA iceberg.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`; -- uses Iceberg manifest for metadataHowever...

  • 860 Views
  • 0 replies
  • 2 kudos
jakubk
by Contributor
  • 454 Views
  • 0 replies
  • 0 kudos

databricks spark sql Custom table valued function + struct really slow (minutes for a single row)

I'm using azure databricksI have a custom table valued function which takes a URL as a parameter and outputs a single row table with certain elements from the URL extracted/labelled(i get search activity URLs and when in a specific format I can retri...

  • 454 Views
  • 0 replies
  • 0 kudos
sage5616
by Valued Contributor
  • 9724 Views
  • 3 replies
  • 2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

  • 9724 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

  • 2 kudos
2 More Replies
Ross
by New Contributor II
  • 1742 Views
  • 4 replies
  • 0 kudos

Failed to install cluster scoped SparkR library

Attempting to install SparkR to the cluster and successfully installed other packages such as tidyverse via CRAN. The error is copied below, any help you can provide is greatly appreciated!Databricks runtime 10.4 LTSLibrary installation attempted on ...

  • 1742 Views
  • 4 replies
  • 0 kudos
Latest Reply
Vivian_Wilfred
Honored Contributor
  • 0 kudos

Hi @Ross Hamilton​ ,I believe SparkR comes inbuilt with Databricks RStudio and you don't have to install it explicitly. You can directly import it with library(SparkR) and it works for you from your above comment.The error message you see could be re...

  • 0 kudos
3 More Replies
Anonymous
by Not applicable
  • 344 Views
  • 1 replies
  • 1 kudos

The Next Databricks Office HoursOur next Office Hours session is scheduled for February 23, 2022 - 8:00 am PDT Do you have questions about how to set ...

The Next Databricks Office HoursOur next Office Hours session is scheduled for February 23, 2022 - 8:00 am PDTDo you have questions about how to set up or use Databricks? Do you want to get best practices for deploying your use case or tips on data a...

  • 344 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

Great!

  • 1 kudos
ftc
by New Contributor II
  • 1495 Views
  • 3 replies
  • 0 kudos

Resolved! Multi-Hop Architecture for ingestion data via http API

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? ...

  • 1495 Views
  • 3 replies
  • 0 kudos
Latest Reply
artsheiko
Valued Contributor III
  • 0 kudos

The API -> Cloud Storage -> Delta is more suitable approach.Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference ev...

  • 0 kudos
2 More Replies
Hubert-Dudek
by Esteemed Contributor III
  • 469 Views
  • 1 replies
  • 24 kudos

The new Databricks jobs matrix is awesome! but looking at it can be addictive 

The new Databricks jobs matrix is awesome!but looking at it can be addictive

matrix1
  • 469 Views
  • 1 replies
  • 24 kudos
Latest Reply
Kaniz
Community Manager
  • 24 kudos

Thank you @Hubert Dudek​ for the fantastic post!

  • 24 kudos
ASN
by New Contributor II
  • 7329 Views
  • 5 replies
  • 2 kudos

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record...

Input and expected Output
  • 7329 Views
  • 5 replies
  • 2 kudos
Latest Reply
Pholo
Contributor
  • 2 kudos

Hi, I think you can use this option for the csvReadeespark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")especially the unescapedQuoteHandling. You can search for the other options at this l...

  • 2 kudos
4 More Replies
Rahul_Samant
by Contributor
  • 2232 Views
  • 4 replies
  • 1 kudos

Resolved! Spark Sql Connector :

i am trying to read data from azure sql database from databricks. azure sql database is created with private link endpoint.Using DBR 10.4 LTS Cluster and expectation is the connector is pre installed as per documentation.using the below code to fetch...

  • 2232 Views
  • 4 replies
  • 1 kudos
Latest Reply
artsheiko
Valued Contributor III
  • 1 kudos

It seems that .option("databaseName", "test") is redundant here as you need to include the db name in the url.Please verify that you use a connector compatible to your cluster's Spark version : Apache Spark connector: SQL Server & Azure SQL

  • 1 kudos
3 More Replies
Labels
Top Kudoed Authors