cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ChrisHunt
by Visitor
  • 71 Views
  • 7 replies
  • 0 kudos

Databricks external table lagging behind source files

I have a databricks external table which is pointed at an S3 bucket which contains an ever-growing number of parquet files (currently around 2000 of them). Each row in the file is timestamped to indicate when it was written. A new parquet file is add...

  • 71 Views
  • 7 replies
  • 0 kudos
Latest Reply
Prajapathy_NKR
New Contributor III
  • 0 kudos

@Raman_Unifeye  and @Coffee77  there was a situation when a parquet file were deleted because it was obsolete when a vacuum was executed. Post which the job started to fail, it was saying unable to find the parquet file, even though it was reading an...

  • 0 kudos
6 More Replies
dkhodyriev1208
by Visitor
  • 30 Views
  • 3 replies
  • 2 kudos

Spark SQL INITCAP not capitalizing letters after periods in abbreviations

Using SELECT INITCAP("text (e.g., text, text, etc.)") abbreviations with periods like e.g. are not being fully capitalized.Current behavior:Input: "text (e.g., text, text, etc.)"Output: "Text (e.g., Text, Text, Etc.)"Expected behavior:Output: "Text ...

  • 30 Views
  • 3 replies
  • 2 kudos
Latest Reply
Coffee77
Contributor III
  • 2 kudos

My solution is indeed a workaround. INITCAP is behaving as you comment. You can include another regular expression at the beginning to remove non-original "spaces" but I agree that makes it a little complex. However, no other solution so far I'm awar...

  • 2 kudos
2 More Replies
Suheb
by New Contributor III
  • 17 Views
  • 1 replies
  • 0 kudos

What are common pitfalls when migrating large on-premise ETL workflows to Databricks and how did you

When moving your big data pipelines from local servers to Databricks, what problems usually happen, and how did you fix them?

  • 17 Views
  • 1 replies
  • 0 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 0 kudos

Very broad Question. It depends on several factors.There were few community discussions in past, see if any useful for yourself.https://community.databricks.com/t5/technical-blog/6-migration-mistakes-you-don-t-want-to-make-part-1/ba-p/89199https://co...

  • 0 kudos
Dimitry
by Contributor III
  • 50 Views
  • 4 replies
  • 0 kudos

Dataframe from SQL query glitches when grouping - what is going on !?!

I have a query with some grouping. I'm using spark.sql to run that query.skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')It displays as expected table.This table I want to split into batches for processing, ...

Dimitry_1-1763964698802.png Dimitry_2-1763964772801.png Dimitry_3-1763964861816.png Dimitry_4-1763964998951.png
  • 50 Views
  • 4 replies
  • 0 kudos
Latest Reply
Coffee77
Contributor III
  • 0 kudos

Try to use this code customized in the way you need:Instead of using monotonically_increasing_id function directly, use row_number over the previous result. This will ensure sequential "small" numbers. This was indeed the exact solution I used to sol...

  • 0 kudos
3 More Replies
Aviraldb
by New Contributor
  • 44 Views
  • 3 replies
  • 0 kudos

Moving files from Volume to Workspace

Hello Team,I am trying to move some files from volume to %shdatabricks fs cp dbfs:/Volumes/workspace/default/delc/generated_scripts/*.py Workspace/Shared/Delc_Project/scripts/ I tried all ways , Please help me to move them  @DataBricks @Louis_Frolio ...

  • 44 Views
  • 3 replies
  • 0 kudos
Latest Reply
Prajapathy_NKR
New Contributor III
  • 0 kudos

@Aviraldb please try the below way,%shcp /dbfs/Volumes/workspace/default/delc/generated_scripts/*.py  /Workspace/Shared/Delc_Project/scripts/ Hope it helps.

  • 0 kudos
2 More Replies
Suheb
by New Contributor III
  • 68 Views
  • 2 replies
  • 4 kudos

What strategies have you found most effective for optimizing ETL pipelines built on the Databricks L

If you are building data pipelines in Databricks (where data is Extracted, Transformed, and Loaded), what tips, methods, or best practices do you use to make those pipelines run faster, cheaper, and more efficiently?

  • 68 Views
  • 2 replies
  • 4 kudos
Latest Reply
bianca_unifeye
New Contributor III
  • 4 kudos

When I think about optimising ETL on the Databricks Lakehouse, I split it into four layers: data layout, Spark/SQL design, platform configuration, and operational excellence.And above all: you are not building pipelines for yourself, you are building...

  • 4 kudos
1 More Replies
vr
by Contributor III
  • 16 Views
  • 0 replies
  • 0 kudos

remote_query() is not working

I am trying to experiment with remote_query() function according to the documentation. The feature is in public preview, so I assume it should be available to everyone now.select * from remote_query( 'my_connection', database => 'mydb', dbtable...

  • 16 Views
  • 0 replies
  • 0 kudos
Naveenkumar1811
by New Contributor II
  • 131 Views
  • 4 replies
  • 2 kudos

SkipChangeCommit to True Scenario on Data Loss Possibility

Hi Team,I have Below Scenario,I have a Spark Streaming Job with trigger of Processing time as 3 secs Running Continuously 365 days.We are performing a weekly delete job from the source of this streaming job based on custom retention policy. it is a D...

  • 131 Views
  • 4 replies
  • 2 kudos
Latest Reply
Naveenkumar1811
New Contributor II
  • 2 kudos

Hi szymon/Raman,My Question was on the commit it performs with the insert/append via my streaming and the delete operation by the weekly maintenance Job... Is there a way that both transaction will fall into same commit. I need to understand that por...

  • 2 kudos
3 More Replies
Shivaprasad
by Contributor
  • 83 Views
  • 1 replies
  • 0 kudos

In databricks custom APP how I can retrieve Genie parameters and use it in app

I have created a databricks custom APP and it is working. I need to pass parameters from Genie to the custom app. Can someone able to suggest on how I can achieve it. 

  • 83 Views
  • 1 replies
  • 0 kudos
Latest Reply
stbjelcevic
Databricks Employee
  • 0 kudos

You can pass values between a Genie space and your Databricks App using the Genie Conversation API and by adding the Genie space as an app resource: https://docs.databricks.com/aws/en/dev-tools/databricks-apps/genie Do you want the parameters to orig...

  • 0 kudos
dbernstein_tp
by New Contributor III
  • 77 Views
  • 2 replies
  • 1 kudos

Lakeflow Connect CDC error, broken links

I get this error, regarding database validation, when setting up a lakeflow connect CDC pipeline (see screenshot). The two links mentioned in the message are broken, they give me a "404 - Content Not Found" when I try to open them. 

Screenshot 2025-11-21 at 9.42.20 AM.png
  • 77 Views
  • 2 replies
  • 1 kudos
Latest Reply
Advika
Databricks Employee
  • 1 kudos

Sharing a likely doc that should help: https://learn.microsoft.com/en-us/azure/databricks/ingestion/lakeflow-connect/sql-server-utility

  • 1 kudos
1 More Replies
hobrob
by New Contributor
  • 60 Views
  • 2 replies
  • 0 kudos

UDFs for working with date ranges

Hi bricklayers,Originally from a Teradata background and relatively new to Databricks, I was in need of brushing up on my Python and Github CI/CD skills so I’ve spun up a repo for a project I’m calling Terabricks.The aim is to provide a space for mak...

  • 60 Views
  • 2 replies
  • 0 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 0 kudos

Fantastic Initiative @hobrob.I have used Teradata for good 5+ years but pre-2014/5. So I will be closely following it and very happy to contribute to it. Thanks. 

  • 0 kudos
1 More Replies
oye
by New Contributor II
  • 120 Views
  • 4 replies
  • 3 kudos

Resolved! Using a cluster of type SINGLE_USER to run parallel python tasks in one job

Hi, I have set up a job of multiple spark python tasks running in parallel. I have only set up one job cluster, single node, data security mode SINGLE_USER, using Databricks Runtime version 14.3.x-scala2.12. These parallel spark python tasks share so...

  • 120 Views
  • 4 replies
  • 3 kudos
Latest Reply
Raman_Unifeye
Contributor III
  • 3 kudos

@oye - The variables scope is local to the individual task and do no interfere with other tasks even if the underlying cluster is same. In fact, the issue is normally other way round where if we have to share the variable across tasks - Then the solu...

  • 3 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels