cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Isaac_Low
by New Contributor II
  • 2878 Views
  • 2 replies
  • 3 kudos
  • 2878 Views
  • 2 replies
  • 3 kudos
Latest Reply
Isaac_Low
New Contributor II
  • 3 kudos

All good. I just imported the training material manually using the dbc link. Didn't need repos for that.

  • 3 kudos
1 More Replies
davidvb
by New Contributor II
  • 3139 Views
  • 2 replies
  • 1 kudos

I have a big problem creating a community account

It is impossible for me create a community account. I put my data on web and in the next step, when the website show me the 3 type of data ( google, amazn etc) and I click on the “ "Get started with community account" the web show me this  I have try...

problem
  • 3139 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @david vazquez​,It seems like the website was down due to maintenance. You can check the status page next time to check why the website is down https://status.databricks.com/

  • 1 kudos
1 More Replies
darshan
by New Contributor III
  • 2401 Views
  • 2 replies
  • 1 kudos

job init takes longer than notebook run

I am trying to understand why running a job takes longer than running the notebook manually.And if I try to run jobs concurrently using workflow or threads then is there a way to reduce job init time ?

  • 2401 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vivian_Wilfred
Databricks Employee
  • 1 kudos

Hi @darshan doshi​ , Jobs creates a job cluster in the backend before it starts the task execution and this cluster creation may take extra time when compared to running a notebook on a existing cluster.1) If you run a multi-task job, you could selec...

  • 1 kudos
1 More Replies
karthik_p
by Esteemed Contributor
  • 4237 Views
  • 3 replies
  • 9 kudos

Resolved! Unable to create Databricks workspace using Terraform on AWS

HI Team,we are using below workspace config scripts, when we try to create workspace previously from EC2 Instance, we are able to create Workspace without any issue. but when we are trying to run through Github actions, we are getting below errorErro...

  • 4237 Views
  • 3 replies
  • 9 kudos
Latest Reply
Prabakar
Databricks Employee
  • 9 kudos

@karthik p​ this can be fixed by setting timeout. Please check this https://kb.databricks.com/en_US/cloud/failed-credential-validation-checks-error-with-terraform

  • 9 kudos
2 More Replies
MrsBaker
by Databricks Employee
  • 1786 Views
  • 1 replies
  • 1 kudos

display() not updating after 1000 rows

Hello folks! I am calling display() on a streaming query sourced from a delta table. The output from display() displays the new rows added to the source table. But as soon as the output results hit 1000 rows, the output is not updated anymore. As a r...

  • 1786 Views
  • 1 replies
  • 1 kudos
Latest Reply
MrsBaker
Databricks Employee
  • 1 kudos

aggregate function followed by timestamp field sorted in descending order did the trick:streaming_df.groupBy("field1", "time_field").max("field2").orderBy(col("time_field").desc()).display()

  • 1 kudos
KateK
by New Contributor II
  • 2986 Views
  • 2 replies
  • 1 kudos

How do you correctly access the spark context in DLT pipelines?

I have some code that uses RDDs, and the sc.parallelize() and rdd.toDF() methods to get a dataframe back out. The code works in a regular notebook (and if I run the notebook as a job) but fails if I do the same thing in a DLT pipeline. The error mess...

  • 2986 Views
  • 2 replies
  • 1 kudos
Latest Reply
KateK
New Contributor II
  • 1 kudos

Thanks for your help Alex, I ended up re-writing my code with spark UDFs -- maybe there is a better solution with only the Dataframe API but I couldn't find it. To summarize my problem: I was trying to un-nest a large json blob (the fake data in my f...

  • 1 kudos
1 More Replies
palak231
by New Contributor
  • 2204 Views
  • 0 replies
  • 0 kudos

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of t...

A/B Testing:-  A/B testing is the process of comparing two variations as well as two versions of same item and offer the best one in between both of them. Well before doing A/B testing you need to be focus on one problem that you want to resolve and ...

  • 2204 Views
  • 0 replies
  • 0 kudos
Rajendra
by New Contributor II
  • 1612 Views
  • 0 replies
  • 2 kudos

Does databricks support writing the data in Iceberg format?

As I understand databricks supports conversion from Iceberg format to Delta using the command belowCONVERT TO DELTA iceberg.`abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/table`; -- uses Iceberg manifest for metadataHowever...

  • 1612 Views
  • 0 replies
  • 2 kudos
jakubk
by Contributor
  • 1539 Views
  • 0 replies
  • 0 kudos

databricks spark sql Custom table valued function + struct really slow (minutes for a single row)

I'm using azure databricksI have a custom table valued function which takes a URL as a parameter and outputs a single row table with certain elements from the URL extracted/labelled(i get search activity URLs and when in a specific format I can retri...

  • 1539 Views
  • 0 replies
  • 0 kudos
sage5616
by Valued Contributor
  • 23757 Views
  • 3 replies
  • 2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

  • 23757 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

  • 2 kudos
2 More Replies
Ross
by New Contributor II
  • 4896 Views
  • 4 replies
  • 0 kudos

Failed to install cluster scoped SparkR library

Attempting to install SparkR to the cluster and successfully installed other packages such as tidyverse via CRAN. The error is copied below, any help you can provide is greatly appreciated!Databricks runtime 10.4 LTSLibrary installation attempted on ...

  • 4896 Views
  • 4 replies
  • 0 kudos
Latest Reply
Vivian_Wilfred
Databricks Employee
  • 0 kudos

Hi @Ross Hamilton​ ,I believe SparkR comes inbuilt with Databricks RStudio and you don't have to install it explicitly. You can directly import it with library(SparkR) and it works for you from your above comment.The error message you see could be re...

  • 0 kudos
3 More Replies
ftc
by New Contributor II
  • 4589 Views
  • 3 replies
  • 0 kudos

Resolved! Multi-Hop Architecture for ingestion data via http API

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? ...

  • 4589 Views
  • 3 replies
  • 0 kudos
Latest Reply
artsheiko
Databricks Employee
  • 0 kudos

The API -> Cloud Storage -> Delta is more suitable approach.Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference ev...

  • 0 kudos
2 More Replies
ASN
by New Contributor II
  • 17942 Views
  • 5 replies
  • 2 kudos

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record...

Input and expected Output
  • 17942 Views
  • 5 replies
  • 2 kudos
Latest Reply
Pholo
Contributor
  • 2 kudos

Hi, I think you can use this option for the csvReadeespark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")especially the unescapedQuoteHandling. You can search for the other options at this l...

  • 2 kudos
4 More Replies
Rahul_Samant
by Contributor
  • 7285 Views
  • 4 replies
  • 1 kudos

Resolved! Spark Sql Connector :

i am trying to read data from azure sql database from databricks. azure sql database is created with private link endpoint.Using DBR 10.4 LTS Cluster and expectation is the connector is pre installed as per documentation.using the below code to fetch...

  • 7285 Views
  • 4 replies
  • 1 kudos
Latest Reply
artsheiko
Databricks Employee
  • 1 kudos

It seems that .option("databaseName", "test") is redundant here as you need to include the db name in the url.Please verify that you use a connector compatible to your cluster's Spark version : Apache Spark connector: SQL Server & Azure SQL

  • 1 kudos
3 More Replies
mick042
by New Contributor III
  • 2167 Views
  • 1 replies
  • 0 kudos

Does spark utilise a temporary stage when writing to snowflake? How does that work?

Folks , when I want to push data to snowflake I need to use a stage for files before copying data over. However, when I utilise the net.snowflake.spark.snowflake.Utils library and do a spark.write as in...spark.read.format("csv") .option("header", ...

  • 2167 Views
  • 1 replies
  • 0 kudos
Latest Reply
mick042
New Contributor III
  • 0 kudos

Yes it uses a temporary stage. should have just looked in snowflake history

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels