cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Rani
by New Contributor
  • 9622 Views
  • 2 replies
  • 0 kudos

Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala

I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframeI am trying to implement a sample as explained below, I am quite new to th...

  • 9622 Views
  • 2 replies
  • 0 kudos
Latest Reply
subham0611
New Contributor II
  • 0 kudos

@raela I also have similar usecase. I am writing data to different databricks tables based on colum value.But I am getting insufficient disk space error and driver is getting killed. I am suspecting df.select(colName).distinct().collect()step is taki...

  • 0 kudos
1 More Replies
Leszek
by Contributor
  • 7739 Views
  • 1 replies
  • 2 kudos

IDENTITY columns generating every other number when merging

Hi,I'm doing merge to my Delta Table which has IDENTITY column:Id BIGINT GENERATED ALWAYS AS IDENTITYInserted data has in the id column every other number, like this:Is this expected behavior? Is there any workaround to make number increasing by 1?

image
  • 7739 Views
  • 1 replies
  • 2 kudos
Latest Reply
Dataspeaksss
New Contributor II
  • 2 kudos

Were you able to resolve it? I'm facing the same issue.

  • 2 kudos
Mohammad_Younus
by New Contributor
  • 5149 Views
  • 0 replies
  • 0 kudos

Merge delta tables with data more than 200 million

HI Everyone,Im trying to merge two delta tables who have data more than 200 million in each of them. These tables are properly optimized. But upon running the job, the job is taking a long time to execute and the memory spills are huger (1TB-3TB) rec...

Mohammad_Younus_0-1698373999153.png
  • 5149 Views
  • 0 replies
  • 0 kudos
Joe1912
by New Contributor III
  • 1301 Views
  • 0 replies
  • 0 kudos

Issue with MERGE INTO for first batch

I have source data with multiple rows and columns, 1 of column is city. I want to get unique city into other table by stream data from source table. So I trying to use merge into and foreachBatch with my merge function.  My merge condition is : On so...

  • 1301 Views
  • 0 replies
  • 0 kudos
JD2
by Contributor
  • 1495 Views
  • 0 replies
  • 0 kudos

cursor type\loop question

Hello:In my Hive Metastore, I have 35 tables in database that I want to export in excel. I need help on query that can loop one table at a time export one table to excel.Any help is appreciated.Thanking in advance for your kind help.

  • 1495 Views
  • 0 replies
  • 0 kudos
Sahha_Krishna
by New Contributor
  • 9022 Views
  • 1 replies
  • 0 kudos

Unable to start Cluster in Databricks because of `BOOTSTRAP_TIMEOUT`

Unable to start the Cluster in AWS-hosted Databricks because of the below reason{ "reason": { "code": "BOOTSTRAP_TIMEOUT", "parameters": { "databricks_error_message": "[id: InstanceId(i-0634ee9c2d420edc8), status: INSTANCE_INITIALIZIN...

Data Engineering
AWS
EC2
VPC
  • 9022 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16539034020
Databricks Employee
  • 0 kudos

Hi, Sahha: Thanks for contacting Databricks Support.  This is the common type of error, which indicates that the bootstrap failed due to a misconfigured data plane network. Databricks requested EC2 instances for a new cluster, but encountered a long ...

  • 0 kudos
feng_2014
by New Contributor
  • 1324 Views
  • 0 replies
  • 0 kudos

Geoparquet support with Use Photon Acceleration enabled

Hi Experts,Recently our team noticed that when we are using Aparch Sedona to create the parquet file with Geoparquet format, the geo metedata was not created inside the parquet file. But if we turn off the Photon setting, everything was working as ex...

  • 1324 Views
  • 0 replies
  • 0 kudos
Hubert-Dudek
by Esteemed Contributor III
  • 8231 Views
  • 1 replies
  • 1 kudos

The perfect table

Unlock the Power of #Databricks: The Perfect Table in 8 Simple Steps! 

perfec_table8.png perfec_table7.png perfec_table6.png perfec_table5.png
  • 8231 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @Hubert-Dudek, Thank you for sharing this great post

  • 1 kudos
Madhur
by New Contributor
  • 1365 Views
  • 1 replies
  • 0 kudos
  • 1365 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Hi @Madhur, The difference between Auto Optimize set on Spark Session and the one set on Delta Table lies in their scope and precedence. Auto Optimize on Spark Session will apply to all Delta tables in the current session. It is a global configuratio...

  • 0 kudos
krishnaarige
by New Contributor
  • 2233 Views
  • 1 replies
  • 0 kudos

OperationalError: 250003: Failed to get the response. Hanging? method: get

OperationalError: 250003: Failed to get the response. Hanging? method: get, url: https://cdodataplatform.east-us-2.privatelink.snowflakecomputing.com:443/queries/01ae7ab6-0c04-e4bd-011c-e60552f6cf63/result?request_guid=315c25b7-f17d-4123-a2e5-6d82605...

  • 2233 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

could you please share the full error stack trace? 

  • 0 kudos
igorgatis
by New Contributor II
  • 3828 Views
  • 1 replies
  • 1 kudos

How to improve Spark UI Job Description for pyspark?

I find it quite hard to understand Spark UI for my pyspark pipelines. For example, when one writes `spark.read.table("sometable").show()` it shows:I learned that `DataFrame` API actually may spawn jobs before running the actual job. In the example ab...

igorgatis_0-1697034219608.png igorgatis_1-1697034492125.png igorgatis_2-1697034528335.png
  • 3828 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @igorgatis, A polite reminder. Have you had a chance to review my colleague's reply? Please inform us if it contributes to resolving your query.

  • 1 kudos
pygreg
by New Contributor
  • 1945 Views
  • 0 replies
  • 0 kudos

Workflows "Run now with different parameters" UI proposal

Hello everyone!I've been working with the Databricks platform for a few months now and I have a suggestion/proposal regarding the UI interface of Workflows.First, let me explain what I find not so ideal.Let's say we have a job with three Notebook Tas...

  • 1945 Views
  • 0 replies
  • 0 kudos
Rafal9
by New Contributor II
  • 4804 Views
  • 1 replies
  • 1 kudos

DAB: NameError: name '__file__' is not defined

Hi Everyone,I am running job task using Asset Bundle.Bundle has been validated and deployed according to: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/bundles/work-tasksPart of the databricks.yml bundle: name: etldatabricks resourc...

  • 4804 Views
  • 1 replies
  • 1 kudos
Akshay9
by New Contributor
  • 885 Views
  • 0 replies
  • 0 kudos

Databricks Optimization

I am trying to read 30 xml files and create a dataframe of the data of each node but i takes alot of time approximately 8 mins to run those files what i can i do to optimize the databricks notebook and i append the data in a databricks delta table 

  • 885 Views
  • 0 replies
  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels