cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

erigaud
by Honored Contributor
  • 1739 Views
  • 1 replies
  • 0 kudos

Cannot update databricks repos in Devops Pipeline

Hello,I am creating a Devops pipeline to run unit tests in my notebooks using the Nutter library. When a commit is pushed to a branch, I have a pipeline that triggers, and it should update my repo in a folder Staging (/Repos/Staging/MyRepo)For that I...

  • 1739 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16539034020
Contributor II
  • 0 kudos

Hello, Thanks for contacting Databricks Support.The error message indicates that there is an issue with the URL or endpoint you are using with the Databricks repos update command. It appears that one or more required parameters are not being set corr...

  • 0 kudos
ywaihong6123
by New Contributor
  • 4318 Views
  • 1 replies
  • 0 kudos

Libraries Not Working on Shared Cluster 13.3 LTS

I am facing this error while installing the spark-excel library into the cluster. Does anyone know how to add library into artifact allowlist?Jars and Maven Libraries on Shared Clusters must be on the allowlist. Failed Libraries: com.crealytics:spark...

  • 4318 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752239289
Valued Contributor
  • 0 kudos

You can add the jar followed below steps:How to add items to the allowlistYou can add items to the allowlist with Data Explorer or the REST API.To open the dialog for adding items to the allowlist in Data Explorer, do the following:In your Databricks...

  • 0 kudos
Hubert-Dudek
by Esteemed Contributor III
  • 10750 Views
  • 1 replies
  • 2 kudos

Handling GDPR requests in databricks

When dealing with GDPR requests in databricks, there are some essential things to keep in mind:- Use a low retention period to ensure you don't keep table delta version history for tables with personal information.- Use APPLY CHANGES to handle Slowly...

ezgif-3-020e69a4fd.gif
  • 10750 Views
  • 1 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Moderator
  • 2 kudos

Thank you for sharing this information @Hubert-Dudek!!!!

  • 2 kudos
Hubert-Dudek
by Esteemed Contributor III
  • 20741 Views
  • 1 replies
  • 2 kudos

Checking that spark dataframe is empty

#databricks #spark 3.3 has introduced a simple yet powerful isEmpty() function for DataFrames. Gone are the days of using count() to check for empty DataFrames — now it'**bleep** as easy as calling df.isEmpty().

Screenshot 2023-08-29 143021.png
  • 20741 Views
  • 1 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Moderator
  • 2 kudos

Thank you for sharing this @Hubert-Dudek !!!

  • 2 kudos
Eduard
by New Contributor II
  • 29957 Views
  • 2 replies
  • 1 kudos

Cluster xxxxxxx was terminated during the run.

Hello,I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication ...

  • 29957 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @Eduard, The errors you're experiencing could be due to a few reasons: 1. **Lost communication with the driver node**: This error usually occurs due to networking errors or malfunctioning instances. It might be that the driver node is losing its c...

  • 1 kudos
1 More Replies
rt-slowth
by Contributor
  • 1860 Views
  • 1 replies
  • 1 kudos

Resolved! how to build data warehouses and data marts with Python

I don't know how to build data warehouses and data marts with Python. My current development environment is storing data in AWS Redshift, and I can run queries from Databricks against the stacked tables in Redshift.Can you show me some simple code?

  • 1860 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @rt-slowth ,    To interact with AWS Redshift and perform operations such as creating tables, loading data, and querying data, you can use the psycopg2 library in Python.   Here is a simple example to get you started: First, install the necessary ...

  • 1 kudos
NathanSundarara
by Contributor
  • 5848 Views
  • 7 replies
  • 2 kudos

Delta live table generate unique integer value (kind of surrogate key) for combination of columns

Hi,we are in process of moving our Datawarehouse from sql server to databricks. we are in process of testing our Dimension Product table which has identity column for referencing in fact table as surrogate key. In Databricks Apply changes SCD type 2 ...

  • 5848 Views
  • 7 replies
  • 2 kudos
Latest Reply
ilarsen
Contributor
  • 2 kudos

Hey.  Yep, xxhash64 (or even just hash) generate numerical values for you.  Combine with abs function to ensure the value is positive.  In our team we used abs(hash()) ourselves... for maybe a day.  Very quickly I observed a collision, and the data s...

  • 2 kudos
6 More Replies
LukaszJ
by Contributor III
  • 18297 Views
  • 6 replies
  • 2 kudos

Resolved! Install ODBC driver by init script

Hello,I want to install ODBC driver (for pyodbc).I have tried to do it using terraform, however I think it is impossible.So I want to do it with Init Script in my cluster. I have the code from the internet and it works when it is on the beginning of ...

  • 18297 Views
  • 6 replies
  • 2 kudos
Latest Reply
MayaBakh_80151
New Contributor II
  • 2 kudos

Actually found this article and using this to migrate my shell script to workspace.Cluster-named and cluster-scoped init script migration notebook - Databricks 

  • 2 kudos
5 More Replies
srDataEngineer
by New Contributor II
  • 4645 Views
  • 4 replies
  • 0 kudos

Resolved! udf not admin user

java.lang.SecurityException: User does not have permission SELECT on anonymous function.

  • 4645 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @data engineer​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers...

  • 0 kudos
3 More Replies
Retko
by Contributor
  • 3465 Views
  • 0 replies
  • 0 kudos

Custom logging using Log4J to a file

Hello,I would like to ask for help setting up the log4j.I want to use log4j (log4j2) to generate custom log messages in my notebook when running.This message would be generated like this: logger.info("some info message") but using log4j not python lo...

  • 3465 Views
  • 0 replies
  • 0 kudos
Frank
by New Contributor III
  • 882 Views
  • 1 replies
  • 1 kudos

Design Question

we have an application that takes in raw metrics data like key-value pairs. then we split them into four different table like below`key1, min, max, average`Those four tables are later used for dashboard. What are the design recommendations to this? S...

  • 882 Views
  • 1 replies
  • 1 kudos
Latest Reply
stefnhuy
New Contributor III
  • 1 kudos

Hey,I can totally relate to the challenges Frank is facing with this application'**bleep** data processing. It'**bleep** frustrating to deal with delays, especially when dealing with real-time metrics. I've had a similar experience where optimizing d...

  • 1 kudos
564824
by New Contributor II
  • 993 Views
  • 2 replies
  • 1 kudos

Will enabling Unity Catalog affect existing user access and jobs in production?

Hi, at my company, we are using Databricks with AWS IAM identity center as single sign on, I was looking into Unity catalog which seems to offer centralized access but I wanted to know if there will be any downside like loss of existing user profile ...

  • 993 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @564824,  Using Databricks Unity Catalog with AWS IAM identity centre as a single sign-on will not result in losing existing user profile data, workspace notebooks, or denial of access.- When using SCIM provisioning, removing a user or group from ...

  • 1 kudos
1 More Replies
UmaMahesh1
by Honored Contributor III
  • 6201 Views
  • 8 replies
  • 17 kudos

Spark Structured Streaming : Data write is too slow into adls.

I'm a bit new to spark structured streaming stuff so do ask all the relevant questions if I missed any.I have a notebook which consumes the events from a kafka topic and writes those records into adls. The topic is json serialized so I'm just writing...

  • 6201 Views
  • 8 replies
  • 17 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 17 kudos

Hi @UmaMahesh1 ,  • Spark Structured Streaming interacts with Kafka in a certain way, leading to the observed behaviour.• The parameter maxOffsetsPerTrigger in Spark Structured Streaming determines the maximum rate of data read from Kafka.• However, ...

  • 17 kudos
7 More Replies
Matt_L
by New Contributor III
  • 5842 Views
  • 3 replies
  • 3 kudos

Resolved! Slow performance loading checkpoint file?

Using OSS Delta, hopefully this is the right forum for this question:Hey all, I could use some help as I feel like I’m doing something wrong here.I’m streaming from Kafka -> Delta on EMR/S3FS, and am seeing ever-increasingly slow batches.When looking...

  • 5842 Views
  • 3 replies
  • 3 kudos
Latest Reply
Matt_L
New Contributor III
  • 3 kudos

Found the answer through the Slack user group, courtesy of an Adam Binford.I had set `delta.logRetentionDuration='24 HOURS'` but did not set `delta.deletedFileRetentionDuration`, and so the checkpoint file still had all the accumulated tombstones sin...

  • 3 kudos
2 More Replies
r0nald
by New Contributor II
  • 7099 Views
  • 3 replies
  • 1 kudos

UDF not working inside transform() & lambda (SQL)

Below is toy example of what I'm trying to achieve, but don't understand why it fails. Can anyone explain why, and suggest a fix or not overly bloated workaround?%sqlcreate or replace function status_map(status int)returns stringreturn map(10, "STATU...

  • 7099 Views
  • 3 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

the transform function in sql is not the same as the scala/pyspark counterpart.  It is in fact a map().Here is some interesting infoI agree that functions are essential for code modularity.  Hence I prefer not to use sql but scala/pyspark instead.

  • 1 kudos
2 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels