cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

pinaki1
by New Contributor III
  • 112 Views
  • 1 replies
  • 0 kudos

Peformnace improvement of Databricks Spark Job

Hi,I need performance improvement for data bricks job in my project. Here are some steps being done in the project1. Read csv/Json files with small size (100MB,50MB) from multiple locations in s32. Write the data in bronze layer in delta/parquet form...

  • 112 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

In case of performance issues, always look for 'expensive' operations. Mainly wide operations (shuffle) and collecting data to the driver.Start with checking how long the bronze part takes, then silver etc.Pinpoint where it starts to get slow, then d...

  • 0 kudos
BricksGuy
by New Contributor II
  • 472 Views
  • 7 replies
  • 0 kudos

WATER MARK ERROR WHILE JOINING WITH MULTIPLE STREAM TABLES

I am creating a ETL pipeline where i am reading multiple stream table into temp tables and at the end am trying to join those tables to get the output feed into another live table. So for that am using below method where i am giving list of tables as...

  • 472 Views
  • 7 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

it is necessary for the join so if the dataframe has a watermark that's enough.No need to define it multiple times.

  • 0 kudos
6 More Replies
SrinuM
by New Contributor III
  • 112 Views
  • 0 replies
  • 0 kudos

Workspace Client dbutils issue

 host = "https://adb-xxxxxx.xx.azuredatabricks.net"token = "dapxxxxxxx"we are using databricksconnect from databricks.sdk import WorkspaceClientdbutil = WorkspaceClient(host=host,token=token).dbutilsfiles = dbutil.fs.ls("abfss://container-name@storag...

  • 112 Views
  • 0 replies
  • 0 kudos
emorgoch
by New Contributor II
  • 368 Views
  • 2 replies
  • 0 kudos

Resolved! Passing variables from python to sql in a notebook using serverless compute

I've got a notebook that I've written that's going to execute some python code to parse the workspace id to figure out which of my environments that I'm in and set a value for it. I then want to take that value, and pass it through to a code block of...

  • 368 Views
  • 2 replies
  • 0 kudos
Latest Reply
emorgoch
New Contributor II
  • 0 kudos

Thanks Kaniz, this is a great suggestion. I'll look into it and how it can work for my projects.

  • 0 kudos
1 More Replies
tariq
by New Contributor III
  • 2828 Views
  • 5 replies
  • 1 kudos

SqlContext in DBR 14.3

I have a Databricks workspace in GCP and I am using the cluster with the Runtime 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12). I am trying to set the checkpoint directory location using the following command in a notebook:spark.sparkContext.set...

  • 2828 Views
  • 5 replies
  • 1 kudos
Latest Reply
Dave1967
New Contributor II
  • 1 kudos

Has this been resolved, I am encountering the same issue with df.rdd.getNumPartitions()

  • 1 kudos
4 More Replies
MichaelO
by New Contributor III
  • 2073 Views
  • 2 replies
  • 1 kudos

Resolved! Terminating cluster programmatically

Is there any python script that allows me to terminate (not delete)  a cluster in the notebook, similar to this R equivalent ofterminate_cluster(cluster_id, workspace, token = NULL, verbose = T, ...)

  • 2073 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @MichaelO, Yes, you can use the Databricks Python API for that. Here is an example Python function that terminates a cluster given a cluster id: import requests import json import time def terminate_cluster(cluster_id): # Define the endpoint ...

  • 1 kudos
1 More Replies
lbdatauser
by New Contributor II
  • 89 Views
  • 0 replies
  • 0 kudos

dbx with serverless clusters

With dbx, is it impossible to create tasks that run on serverless clusters? Is it necessary to use Databricks bundles for it?https://dbx.readthedocs.io/en/latest/reference/deployment/https://learn.microsoft.com/en-us/azure/databricks/jobs/run-serverl...

  • 89 Views
  • 0 replies
  • 0 kudos
Filip
by New Contributor II
  • 600 Views
  • 3 replies
  • 0 kudos

How to Assign User Managed Identity to DBR Cluster so I can use it for quering ADLSv2?

Hi,I'm trying to figure out if we can switch from Entra ID SPN's to User Assigned Managed Indentities and everything works except I can't figure out how to access the lake files from python notebook.I've tried with below code and was running it on a ...

  • 600 Views
  • 3 replies
  • 0 kudos
Latest Reply
Slash
Contributor
  • 0 kudos

Hi @Filip ,It's obsolete way of configuring access to storage account. Nowadays you should use UC and storage credentials and external location to configure access to storage account. A storage credential is a securable object representing an Azure m...

  • 0 kudos
2 More Replies
semsim
by Contributor
  • 1333 Views
  • 5 replies
  • 0 kudos

Resolved! Installing LibreOffice on Databricks

Hi, I need to install libreoffice to do a document conversion from .docx to .pdf. The requirement is no use of containers. Any idea on how I should go about this? Environment: Databricks 13.3 LTSThanks,Sem

  • 1333 Views
  • 5 replies
  • 0 kudos
Latest Reply
furkan
New Contributor II
  • 0 kudos

Hi @semsim I'm attempting to install LibreOffice for converting DOCX files to PDF and tried running your shell commands from notebook. However, I encountered the 404 errors shown below. Do you have any suggestions on how to resolve this issue? I real...

  • 0 kudos
4 More Replies
PraveenReddy21
by New Contributor III
  • 391 Views
  • 7 replies
  • 2 kudos

Resolved! i created External database but unable to transferring table to Storage Acc(BLOBcontainer-Gold)

Hi , I done activities  Bronze and Silver , after i trying to saving table to Gold  container but unable to storing .i created External database .I want store  data to PARQUET but not supporting ,only DELTA.only  MANAGED LOCATION supporting but unabl...

  • 391 Views
  • 7 replies
  • 2 kudos
Latest Reply
PraveenReddy21
New Contributor III
  • 2 kudos

Thank You  Rishabh.

  • 2 kudos
6 More Replies
szatricia
by New Contributor
  • 148 Views
  • 0 replies
  • 0 kudos

Test And Tren Cycle Reviews 2024: Safe Or Not?

Unquestionably, "Look before you leap." Sure, why not? Where can their circles smoke out incomparable Muscle Building Supplement schedules? This didn't take long. Are we satisfied to assume that in connection with this trite remark? That is pointless...

  • 148 Views
  • 0 replies
  • 0 kudos
Filippo
by New Contributor
  • 95 Views
  • 0 replies
  • 0 kudos

Issue with View Ownership Reassignment in Unity Catalog

Hello,It appears that the ownership rules for views and functions in Unity Catalog do not align with the guidelines provided in the “Manage Unity Catalog object ownership” documentation on Microsoft Learn.When attempting to reassign the ownership of ...

  • 95 Views
  • 0 replies
  • 0 kudos
KosmaS
by New Contributor III
  • 194 Views
  • 2 replies
  • 0 kudos

Skewness / Salting with countDistinct

Hey Everyone,I experience data skewness for: df = (source_df .unionByName(source_df.withColumn("region", lit("Country"))) .groupBy("zip_code", "region", "device_type") .agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").a...

Screenshot 2024-08-05 at 17.24.08.png
  • 194 Views
  • 2 replies
  • 0 kudos
Latest Reply
KosmaS
New Contributor III
  • 0 kudos

Hey @Kaniz_Fatma thanks for the reply. I tried to spend some time on your response.You're suggesting 'double aggregation' and as I'd be guessing it should look more or less this way:df = (source_df .unionByName(source_df.withColumn("region", lit("Cou...

  • 0 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels