cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

TinaDouglass
by Visitor
  • 35 Views
  • 1 replies
  • 0 kudos

Summarized Data from Source system into Bronze

Hello,We are just starting with Databricks. Quick question.  We have a table in our legacy source system that summarizes values that are used on legacy reports and used for payment in our legacy system.  The business wants a dashboard on our new plat...

  • 35 Views
  • 1 replies
  • 0 kudos
Latest Reply
Khaja_Zaffer
Contributor
  • 0 kudos

Hello @TinaDouglass Good day!!First of all, welcome to databricks which is a unified platform. I took some time to answer your query in detail.  Use Databricks' medallion architecture (also called bronze-silver-gold layers) to structure your data pip...

  • 0 kudos
kmodelew
by New Contributor III
  • 82 Views
  • 7 replies
  • 12 kudos

Unable to read excel file from Volume

Hi, I'am trying to read excel file directly from Volume (not workspace or filestore) -> all examples on the internet use workspace or filestore. Volume is external location so I can read from there but I would like to read directly from Volume. I hav...

  • 82 Views
  • 7 replies
  • 12 kudos
Latest Reply
TheOC
Contributor
  • 12 kudos

@szymon_dybczak I think we also need to be conscious of the damage blind LLM usage can do.I'd hope it'd get caught early in a Community message, and in this case the hallucination was relatively harmless. However, There are plenty of instances on red...

  • 12 kudos
6 More Replies
Michał
by Visitor
  • 48 Views
  • 2 replies
  • 2 kudos

how to process a streaming lakeflow declarative pipeline in batches

Hi, I've got a problem and I have run out of ideas as to what else I can try. Maybe you can help? I've got a delta table with hundreds millions of records on which I have to perform relatively expensive operations. I'd like to be able to process some...

  • 48 Views
  • 2 replies
  • 2 kudos
Latest Reply
Michał
Visitor
  • 2 kudos

Thanks @szymon_dybczak. From my experiments so far, you can set `maxFilesPerTrigger`, `maxBytesPerTrigger` and other settings in both Python and SQL code when you declare streaming tables in declarative pipelines.,However, I don't see any evidence th...

  • 2 kudos
1 More Replies
yinan
by New Contributor
  • 53 Views
  • 4 replies
  • 4 kudos

Resolved! Does the free version of Databricks not support external storage data sources?

1、Can the data I use with the free version of Databricks on Azure only be stored on Azure, AWS, and Google Cloud Storage?2、Assuming the network is connected, can the paid version be used to access other publicly stored data (i.e., custom storage spac...

  • 53 Views
  • 4 replies
  • 4 kudos
Latest Reply
BS_THE_ANALYST
Honored Contributor III
  • 4 kudos

Not sure if this is a cheeky way to get around bringing files in: https://community.databricks.com/t5/data-engineering/connect-to-azure-data-lake-storage-using-databricks-free-edition/m-p/127900#M48116 but I answered a similar thing on a different po...

  • 4 kudos
3 More Replies
ManoramTaparia
by Visitor
  • 37 Views
  • 1 replies
  • 1 kudos

Identify updated rows during incremental refresh in DLT Materialized Views

Hello, every time that I run a delta live table materialized view in serverless , I get a log of "COMPLETE RECOMPUTE". I realised I was using current_timestamp as a column in my MV to identify rows which got updated in the last refresh. But that make...

  • 37 Views
  • 1 replies
  • 1 kudos
Latest Reply
ck7007
New Contributor II
  • 1 kudos

@ManoramTaparia The issue is that current_timestamp() makes your MV non-deterministic, forcing complete recomputes. Here's how to fix it:Solution: Use the Source Table's Change TrackingOption 1: Leverage Source Table's Timestamp Column@Dlt.table(name...

  • 1 kudos
yinan
by New Contributor
  • 84 Views
  • 5 replies
  • 2 kudos
  • 84 Views
  • 5 replies
  • 2 kudos
Latest Reply
Khaja_Zaffer
Contributor
  • 2 kudos

Hello @yinan Good day!!Databricks, being a cloud-based platform, does not have direct built-in support for reading data from a truly air-gapped (completely offline, no network connectivity) Cloudera Distribution for Hadoop (CDH) environment.  In such...

  • 2 kudos
4 More Replies
Kurgod
by New Contributor II
  • 150 Views
  • 2 replies
  • 0 kudos

Using Databricks to transform cloudera lakehouse on-prem without bringing the data to cloud

I am looking for a solution to connect databricks to cloudera lakehouse hosted on-prem and transform the data using databricks without bringing the data to databricks delta tables or cloud storage. once the transformation is done the data need to be ...

  • 150 Views
  • 2 replies
  • 0 kudos
Latest Reply
BR_DatabricksAI
Contributor III
  • 0 kudos

Hello, What is your data volume? You can connect using  jdbc/odbc but this process will be slower if the data volume is too high.Another way of connecting is if your cloudera storage is in HDFS then you can also connect through HDFS API as well.  

  • 0 kudos
1 More Replies
Nabbott
by Visitor
  • 22 Views
  • 0 replies
  • 0 kudos

Databrick Genie

I have curated silver and gold tables in Advana that feed downstream applications. Other organizations also create tables for their own use. Can Databricks Genie query across tables from different pipelines within the same organization and across mul...

  • 22 Views
  • 0 replies
  • 0 kudos
azam-io
by New Contributor II
  • 593 Views
  • 4 replies
  • 2 kudos

How can I structure pipeline-specific job params separately in Databricks Asset Bundle.

Hi all, I am working with databricks asset bundle and want to separate environment-specific job params (for example, for "env" and "dev") for each pipeline within my bundle. I need each pipeline to have its own job params values for different environ...

  • 593 Views
  • 4 replies
  • 2 kudos
Latest Reply
Michał
Visitor
  • 2 kudos

Hi azam-io, were you able to solve your problem? Are you trying to have different parameters depending on the environment, or a different parameter value? I think the targets would allow to specify different parameters per environment / target. As fo...

  • 2 kudos
3 More Replies
seefoods
by Contributor II
  • 1667 Views
  • 2 replies
  • 1 kudos

Resolved! assets bundle

Hello Guys,I am working on assets bundle. So i want to make it generic for all team like ( analytics, data engineering), Someone could you share a best practice for this purpose ? Cordially, 

  • 1667 Views
  • 2 replies
  • 1 kudos
Latest Reply
Michał
Visitor
  • 1 kudos

Hi seefoods, Were you able to achieve that generic asset bundle setup? I've been working on something, potentially, similar, and I'd be happy to discuss it, hoping to share experiences. While what I have works for a few teams, it is focused on declar...

  • 1 kudos
1 More Replies
SharathE
by New Contributor III
  • 1886 Views
  • 3 replies
  • 1 kudos

Incremental refresh of materialized view in serverless DLT

Hello, Every time that I run a delta live table materialized view in serverless , I get a log of "COMPLETE RECOMPUTE" . How can I achieve incremental refresh in serverless in DLT pipelines?

  • 1886 Views
  • 3 replies
  • 1 kudos
Latest Reply
drewipson
New Contributor III
  • 1 kudos

Make sure you are using the aggregates and SQL restrictions outlined in this article. https://docs.databricks.com/en/optimizations/incremental-refresh.htmlIf a SQL function is non-deterministic (current_timestamp() is a common one) you will have a CO...

  • 1 kudos
2 More Replies
korijn
by New Contributor II
  • 647 Views
  • 4 replies
  • 0 kudos

Git integration inconsistencies between git folders and job git

It's a little confusing and limiting that the git integration support is inconsistent between the two options available.Sparse checkout is only supported when using a workspace Git folder, and checking out by commit hash is only supported when using ...

  • 647 Views
  • 4 replies
  • 0 kudos
Latest Reply
_J
New Contributor II
  • 0 kudos

Same here, could be a good improvement for the jobs layer guys!

  • 0 kudos
3 More Replies
ck7007
by New Contributor II
  • 44 Views
  • 2 replies
  • 1 kudos

Streaming Solution

Maintain Zonemaps with Streaming Writes Challenge: Streaming breaks zonemaps due to constant micro-batches.Solution: Incremental Updatesdef write_streaming_with_zonemap(stream_df, table_path):def update_zonemap(batch_df, batch_id):# Write databatch_d...

  • 44 Views
  • 2 replies
  • 1 kudos
Latest Reply
ManojkMohan
Contributor III
  • 1 kudos

@ck7007   Yes i am interested to collaborate  . AM stucturing the problem like belowthe challenge is: How can we leverage the query performance benefits of zonemaps without sacrificing the ingestion performance of a streaming pipeline? Problem Statem...

  • 1 kudos
1 More Replies
stucas
by New Contributor
  • 24 Views
  • 0 replies
  • 0 kudos

DLT Pipeline and Pivot tables

TLDR:Can DLT determine a dynamic schema - one which is generated from the results of a pivot?IssueI know you cant use spark `.pivot` in DLT pipeline and that if you wish to pivot data you need to do that outside of the DLT decorated functions. I have...

  • 24 Views
  • 0 replies
  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels