cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Matt_L
by New Contributor III
  • 7664 Views
  • 3 replies
  • 3 kudos

Resolved! Slow performance loading checkpoint file?

Using OSS Delta, hopefully this is the right forum for this question:Hey all, I could use some help as I feel like I’m doing something wrong here.I’m streaming from Kafka -> Delta on EMR/S3FS, and am seeing ever-increasingly slow batches.When looking...

  • 7664 Views
  • 3 replies
  • 3 kudos
Latest Reply
Matt_L
New Contributor III
  • 3 kudos

Found the answer through the Slack user group, courtesy of an Adam Binford.I had set `delta.logRetentionDuration='24 HOURS'` but did not set `delta.deletedFileRetentionDuration`, and so the checkpoint file still had all the accumulated tombstones sin...

  • 3 kudos
2 More Replies
UmaMahesh1
by Honored Contributor III
  • 9485 Views
  • 7 replies
  • 17 kudos

Spark Structured Streaming : Data write is too slow into adls.

I'm a bit new to spark structured streaming stuff so do ask all the relevant questions if I missed any.I have a notebook which consumes the events from a kafka topic and writes those records into adls. The topic is json serialized so I'm just writing...

  • 9485 Views
  • 7 replies
  • 17 kudos
Latest Reply
Miletto
New Contributor II
  • 17 kudos

 

  • 17 kudos
6 More Replies
564824
by New Contributor II
  • 1361 Views
  • 1 replies
  • 1 kudos

Will enabling Unity Catalog affect existing user access and jobs in production?

Hi, at my company, we are using Databricks with AWS IAM identity center as single sign on, I was looking into Unity catalog which seems to offer centralized access but I wanted to know if there will be any downside like loss of existing user profile ...

  • 1361 Views
  • 1 replies
  • 1 kudos
Latest Reply
Atanu
Databricks Employee
  • 1 kudos

You can look into this doc https://docs.databricks.com/en/data-governance/unity-catalog/migrate.html which have some details about your question here. 

  • 1 kudos
SaraCorralLou
by New Contributor III
  • 8553 Views
  • 7 replies
  • 2 kudos

Bad performance UDFs functions

Hello,I am contacting you because I am having a problem with the performance of my notebooks on databricks.My notebook is written in python (pypark) in it I read a delta table that I copy to a dataframe and do several transformations and create sever...

SaraCorralLou_0-1692357805407.png
  • 8553 Views
  • 7 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

looping over records is a performance killer.  To be avoided at all costs.beware the for-loop (databricks.com)

  • 2 kudos
6 More Replies
Chris_Shehu
by Valued Contributor III
  • 3807 Views
  • 2 replies
  • 1 kudos

Resolved! Custom Library's(Unity Catalog Enabled Clusters)

I'm trying to use a custom library that I created from a .whl file in the workspace/shared location. The library attaches to the cluster without any issues and I can it when I list the modules using pip. When I try to call the module I get an error t...

  • 3807 Views
  • 2 replies
  • 1 kudos
Latest Reply
Szpila
New Contributor III
  • 1 kudos

Hello Guys,I am working on the project where we need to use spark-excel library (Maven) in order to ingest data from excel files. As those 3rd party library are not allowed on shared cluster, do you have any workaround other then using pandas for exa...

  • 1 kudos
1 More Replies
User15986662700
by New Contributor III
  • 5400 Views
  • 4 replies
  • 1 kudos
  • 5400 Views
  • 4 replies
  • 1 kudos
Latest Reply
User15986662700
New Contributor III
  • 1 kudos

Yes, it is possible to connect databricks to a kerberized hbase cluster. The attached article explains the steps. It consists of setting up a kerberos client using a keytab in the cluster nodes, installing the hbase-spark integration library, and set...

  • 1 kudos
3 More Replies
naga_databricks
by Contributor
  • 4553 Views
  • 1 replies
  • 0 kudos

Reading bigquery data using a query

To read Bigquery data using spark.read, i'm using a query. This query executes and creates a table on the materializationDataset. df = spark.read.format("bigquery") \.option("query", query) \.option("materializationProject", materializationProject) \...

  • 4553 Views
  • 1 replies
  • 0 kudos
EDDatabricks
by Contributor
  • 1518 Views
  • 2 replies
  • 2 kudos

Appropriate storage account type for reference data (Azure)

Hello,We are using a reference dataset for our Production applications. We would like to create a delta table for this dataset to be used from our applications. Currently, manual updates will occur on this dataset through a script on a weekly basis. ...

Data Engineering
Delta Live Table
Storage account
  • 1518 Views
  • 2 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

+1 for ADLS.  Hierarchical storage, hot/cold/premium storage, things not possible in blob storage

  • 2 kudos
1 More Replies
irispan
by New Contributor II
  • 4793 Views
  • 4 replies
  • 1 kudos

Recommended Hive metastore pattern for Trino integration

Hi, i have several questions regarding Trino integration:Is it recommended to use an external Hive metastore or leverage on the databricks-maintained Hive metastore when it comes to enabling external query engines such as Trino?When I tried to use ex...

test - Databricks
  • 4793 Views
  • 4 replies
  • 1 kudos
Latest Reply
JunlinZeng
Databricks Employee
  • 1 kudos

> Is it recommended to use an external Hive metastore or leverage on the databricks-maintained Hive metastore when it comes to enabling external query engines such as Trino?Databricks maintained hive metastore is not suggested to be used externally. ...

  • 1 kudos
3 More Replies
Agus1
by New Contributor III
  • 6372 Views
  • 3 replies
  • 3 kudos

Update destination table when using Spark Structured Streaming and Delta tables

I’m trying to implement a streaming pipeline that will run hourly using Spark Structured Streaming, Scala and Delta tables. The pipeline will process different items with their details.The source are delta tables that already exists, written hourly u...

  • 6372 Views
  • 3 replies
  • 3 kudos
Latest Reply
Tharun-Kumar
Databricks Employee
  • 3 kudos

@Agus1 Could you try using CDC in delta. You could use readChangeFeed to read only the changes that got applied on the source table. This is also explained here.https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed

  • 3 kudos
2 More Replies
Eric_Kieft
by New Contributor III
  • 3233 Views
  • 2 replies
  • 1 kudos

Unity Catalog Table/View Column Data Type Changes

When changing a delta table column data type in Unity Catalog, we noticed a view that is referencing that table did not automatically update to reflect the new data type.Is there a way to update the delta table column data type so that it also update...

  • 3233 Views
  • 2 replies
  • 1 kudos
Latest Reply
Lakshay
Databricks Employee
  • 1 kudos

Can you try refreshing the view by running the command: REFRESH TABLE <viewname>

  • 1 kudos
1 More Replies
Vibhor
by Contributor
  • 5187 Views
  • 5 replies
  • 4 kudos

Resolved! Cluster Performance

Facing an issue with cluster performance, in event log can see - cluster is not responsive likely due to GC. Number of pipeline (databricks notebooks) running and cluster configuration is same as it used to be before but started seeing this issue sin...

  • 5187 Views
  • 5 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 4 kudos

Hi @Vibhor Sethi​ ,Do you see any other error messages? did you data volume increase? what kind of job are you running?

  • 4 kudos
4 More Replies
ajain80
by New Contributor III
  • 23414 Views
  • 5 replies
  • 10 kudos

Resolved! SFTP Connect

How I can connect sftp server from databricks. So I can write files into tables directly?

  • 23414 Views
  • 5 replies
  • 10 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 10 kudos

The classic solution is to copy data from FTP to ADLS storage using Azure Data Factory, and after the copy is done in the ADF pipeline, trigger the databricks notebook.

  • 10 kudos
4 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels