cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

KosmaS
by New Contributor III
  • 953 Views
  • 3 replies
  • 1 kudos

Skewness / Salting with countDistinct

Hey Everyone,I experience data skewness for: df = (source_df .unionByName(source_df.withColumn("region", lit("Country"))) .groupBy("zip_code", "region", "device_type") .agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").a...

Screenshot 2024-08-05 at 17.24.08.png
  • 953 Views
  • 3 replies
  • 1 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 1 kudos

you can make use of databricks native feature "Liquid Clustering", cluster by the columns which you want to use in grouping statements, it will handle the performance issue due to data skewness .For more information, please do visit :https://docs.dat...

  • 1 kudos
2 More Replies
garciargs
by New Contributor III
  • 228 Views
  • 2 replies
  • 2 kudos

Resolved! Incremental load from two tables

Hi, I am looking to build a ETL process for a incremental load silver table.This silver table, lets say "contracts_silver", is built by joining two bronze tables, "contracts_raw" and "customer".contracts_silverCONTRACT_IDSTATUSCUSTOMER_NAME1SIGNEDPet...

  • 228 Views
  • 2 replies
  • 2 kudos
Latest Reply
garciargs
New Contributor III
  • 2 kudos

Hi @hari-prasad ,Thank you! Will give it a try.Regards!

  • 2 kudos
1 More Replies
ashraf1395
by Valued Contributor
  • 85 Views
  • 1 replies
  • 1 kudos

Solution Design for an ingestion workflow with 1000s of tables for each source

Working on an ingestion workflow in databricks which extracts data from on-prem sources in databricks following all standard practices of incremental load, indempotency, upsert, schema evolution etc and storing data properly.Now we want to optimize t...

  • 85 Views
  • 1 replies
  • 1 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 1 kudos

I do did the similar kind of work in my recent project, where I need to run many SQL DDL's , so I automated the process using databricks jobs, capturing the dependency using a metadata table and creating tasks likewise in job through job api's, doing...

  • 1 kudos
analytics_eng
by New Contributor II
  • 406 Views
  • 2 replies
  • 1 kudos

Connection reset by peer logging when importing custom package

Hi! I'm trying to import a custom package I published to Azure Artifacts, but I keep seeing the INFO logging below, which I don't want to display. The package was installed correctly on the cluster, and it imports successfully, but the log still appe...

  • 406 Views
  • 2 replies
  • 1 kudos
Latest Reply
analytics_eng
New Contributor II
  • 1 kudos

Thanks for the suggestions. I investigated all of the above, but they didn't provide a solution. What did work was using another logging package within my custom package: Loguru. Not sure why this helped?

  • 1 kudos
1 More Replies
adityarai316
by New Contributor III
  • 1158 Views
  • 6 replies
  • 2 kudos

Mount point in unity catalog

Hi Everyone,In my existing notebooks we have used mount points url as /mnt/ and we have more than 200 notebooks where we have used the above url to fetch the data/file from the container. Now as we are upgrading to unity catalog these url will no lon...

  • 1158 Views
  • 6 replies
  • 2 kudos
Latest Reply
NaveenBedadala
New Contributor II
  • 2 kudos

@adityarai316  did u get the solution because I am facing the same issue?

  • 2 kudos
5 More Replies
michaelh
by New Contributor III
  • 4071 Views
  • 5 replies
  • 4 kudos

Resolved! AWS Databricks Cluster terminated.Reason:Container launch failure

We're developing custom runtime for databricks cluster. We need to version and archive our clusters for client. We made it run successfully in our own environment but we're not able to make it work in client's environment. It's large corporation with...

  • 4071 Views
  • 5 replies
  • 4 kudos
Latest Reply
NandiniN
Databricks Employee
  • 4 kudos

This appears to be an issue with the security group. Kindly review security group inbound/outbound rules.

  • 4 kudos
4 More Replies
franc_bomb
by New Contributor II
  • 208 Views
  • 7 replies
  • 0 kudos

Cluster creation issue

Hello,I just started using Databricks community version for learning purposes.I have been trying to create a cluster but the first time it failed asking me to retry or contact the support, and now it's just running forever.What could be the problem? 

  • 208 Views
  • 7 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

Can you please perform one test, check on the cloud provider if you are able to start a node?

  • 0 kudos
6 More Replies
leymariv
by New Contributor
  • 167 Views
  • 1 replies
  • 0 kudos

Performance issue writing an extract of a huge unpartitionned single column dataframe

I have a huge df (40 billions rows) shared by delta share that has only one column 'payload' which contains json and that is not partitionned:Even if all those payloads are not the same, they have a common col sessionId that i need to extract to be a...

leymariv_2-1737155764713.png leymariv_0-1737155486874.png
  • 167 Views
  • 1 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @leymariv,You can check the schema of data in delta sharing table, using df.printSchema to better understand the JSON structure. Use from_json function to flatten or normalize the data to respective columns.Additionally, you can understand how dat...

  • 0 kudos
Maksym
by New Contributor III
  • 8782 Views
  • 5 replies
  • 7 kudos

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark .readStream .format("cloudFiles") .option('cloudFil...

  • 8782 Views
  • 5 replies
  • 7 kudos
Latest Reply
lassebe
New Contributor II
  • 7 kudos

I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!

  • 7 kudos
4 More Replies
kasiviss42
by New Contributor III
  • 207 Views
  • 3 replies
  • 0 kudos

Predicate pushdown query

Does predicate pushdown works when we provide a filter on a dataframe reading a delta table with 2 lakh values i.efilter condition:column is in(list)list contains 2lakh elements  i need to get n number of columns from a table i am currently using joi...

  • 207 Views
  • 3 replies
  • 0 kudos
Latest Reply
hari-prasad
Valued Contributor II
  • 0 kudos

Hi @kasiviss42,This might sound like a rhetorical question, but let’s delve into the complexity of joins and filters and examine how generating a list of 2lakh values affects it. Let's assume we have fact table with 1 billion record and dimension tab...

  • 0 kudos
2 More Replies
NavyaSinghvi
by New Contributor III
  • 2260 Views
  • 6 replies
  • 2 kudos

Resolved! File_arrival trigger in Workflow

I am using  "job.trigger.file_arrival.location" in job parameters to get triggered file location . But I am getting error "job.trigger.file_arrival.location is not allowed". How can I get triggered file location in workflow ? 

  • 2260 Views
  • 6 replies
  • 2 kudos
Latest Reply
raghu2
New Contributor III
  • 2 kudos

The parameters are passed as widgets to the job. After defining the parameters in the job definition, With following code I was able to access the data associated with the parameter:widget_names = ["loc1", "loc2", "loc3"]  # Add all expected paramete...

  • 2 kudos
5 More Replies
adam_mich
by New Contributor II
  • 604 Views
  • 10 replies
  • 0 kudos

How to Pass Data to a Databricks App?

I am developing a Databricks application using the Streamlit package. I was able to get a "hello world" app deployed successfully, but now I am trying to pass data that exists in the dbfs on the same instance. I try to read a csv saved to the dbfs bu...

  • 604 Views
  • 10 replies
  • 0 kudos
Latest Reply
txti
New Contributor III
  • 0 kudos

I have the identical problem in Databricks Apps.  I have tried...Reading from DBFS path using mount version `/dbfs/myfolder/myfile` and protocol `dbfs:/myfolder/myfile`Reading from Unity Volumes `/Volumes/mycatalog/mydatabase/myfolder/myfile`Also mad...

  • 0 kudos
9 More Replies
Harish2122
by Contributor
  • 15980 Views
  • 10 replies
  • 13 kudos

Databricks SQL string_agg

Migrating some on-premise SQL views to Databricks and struggling to find conversions for some functions. the main one is the string_agg function.string_agg(field_name, ', ')​Anyone know how to convert that to Databricks SQL?​Thanks in advance.

  • 15980 Views
  • 10 replies
  • 13 kudos
Latest Reply
smueller
New Contributor II
  • 13 kudos

If not grouping by something else: SELECT array_join(collect_set(field_name), ',') field_list    FROM table

  • 13 kudos
9 More Replies
Abdul-Mannan
by New Contributor III
  • 176 Views
  • 1 replies
  • 0 kudos

Notifications have file information but dataframe is empty using autoloader file notification mode

Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using foreachbatch sink to process the data ...

  • 176 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Here are some potential steps and considerations to troubleshoot and resolve the issue: Permissions and Configuration: Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles ...

  • 0 kudos
thecodecache
by New Contributor II
  • 1845 Views
  • 2 replies
  • 0 kudos

Transpile a SQL Script into PySpark DataFrame API equivalent code

Input SQL Script (assume any dialect) : SELECT b.se10, b.se3, b.se_aggrtr_indctr, b.key_swipe_ind FROM (SELECT se10, se3, se_aggrtr_indctr, ROW_NUMBER() OVER (PARTITION BY SE10 ...

  • 1845 Views
  • 2 replies
  • 0 kudos
Latest Reply
MathieuDB
Databricks Employee
  • 0 kudos

Hello @thecodecache , Have a look the SQLGlot project: https://github.com/tobymao/sqlglot?tab=readme-ov-file#faq It can easily transpile SQL to Spark SQL, like that: import sqlglot from pyspark.sql import SparkSession # Initialize Spark session spar...

  • 0 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels