Data Engineering

Forum Posts

Sorted by:

by sanjay • Valued Contributor II

05-19-2023 1:25:17 AM

7993 Views
0 replies
0 kudos

autoloader with real time and batch processing concurrently

Hi,I have data pipeline which is running continuously, processes the micro batch data and store data in delta lake. This is taking care of any new data.But at times, I need to process historical data without disturbing real time data processing.Is th...

Data Engineering

7993 Views
0 replies
0 kudos

05-19-2023 1:25:17 AM

by Data_Sam • New Contributor II

01-14-2023 10:16:58 AM

1364 Views
1 replies
1 kudos

Streaming data apply change error not function with incoming files

Hi all,When I design a streaming data pipeline with incoming moving files and used apply chnge function on silver table comparing change between bronze and silver for removing duplicates based on key columns, do you know why I got ignore change to tr...

Data Engineering

1364 Views
1 replies
1 kudos

01-14-2023 10:16:58 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 7:42:44 AM

1 kudos

@Raymond Huang :The error message "ignore changes to true" typically occurs when you are trying to apply changes to a table using Delta Lake's change data capture (CDC) feature, but you have set the option ignoreChanges to true. This option tells De...

1 kudos

04-10-2023 7:42:44 AM

by User16835756816 • Databricks Employee

01-23-2023 3:55:06 PM

5264 Views
3 replies
1 kudos

How can I optimize my data pipeline?

Delta Lake provides optimizations that can help you accelerate your data lake operations. Here’s how you can improve query speed by optimizing the layout of data in storage.There are two ways you can optimize your data pipeline: 1) Notebook Optimizat...

Data Engineering

5264 Views
3 replies
1 kudos

01-23-2023 3:55:06 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-24-2023 10:40:50 AM

1 kudos

some tips from me:Look for data skews; some partitions can be huge, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()), especially SQL can divide it unequally to parti...

1 kudos

01-24-2023 10:40:50 AM

2 More Replies

by Meghala • Valued Contributor II

12-26-2022 3:10:03 AM

2920 Views
3 replies
4 kudos

hi every one hope you're doing good here, i have question is' What are the steps of a continuous integration pipeline? And What are the primary obstacles associated with CI/CD when developing a data pipeline?

Data Engineering

2920 Views
3 replies
4 kudos

12-26-2022 3:10:03 AM

View Replies

Latest Reply

ramravi
Contributor II

01-02-2023 6:12:26 AM

4 kudos

https://www.tutorialworks.com/cicd-pipeline-stages/

4 kudos

01-02-2023 6:12:26 AM

2 More Replies

by hello_world • New Contributor III

12-24-2022 6:48:37 PM

5625 Views
7 replies
3 kudos

What happens if I have both DLTs and normal tables in a single notebook?

I've just learned Delta Live Tables on Databricks Academy and have no environment to try it out.I'm wondering what happens to the pipeline if the notebook consists of both normal tables and DLTs. For exampleTable ADLT A that reads and cleans Table AT...

Data Engineering

5625 Views
7 replies
3 kudos

12-24-2022 6:48:37 PM

View Replies

Latest Reply

Rishabh-Pandey
Databricks MVP

12-25-2022 11:15:14 PM

3 kudos

hey ,@S L According to you , you have normal table table A and DLT table Table B , so it will give thrown an error that your upstream table is not streaming Live table and you need to create streaming live table Table a , if you want to use the ou...

3 kudos

12-25-2022 11:15:14 PM

6 More Replies

by User16835756816 • Databricks Employee

11-28-2022 12:04:54 PM

8036 Views
4 replies
11 kudos

How can I extract data from different sources and transform it into a fresh, reliable data pipeline?

Tip: These steps are built out for AWS accounts and workspaces that are using Delta Lake. If you would like to learn more watch this video and reach out to your Databricks sales representative for more information.Step 1: Create your own notebook or ...

Data Engineering

8036 Views
4 replies
11 kudos

11-28-2022 12:04:54 PM

View Replies

Latest Reply

Ajay-Pandey
Databricks MVP

12-04-2022 11:02:29 PM

11 kudos

Thanks @Nithya Thangaraj

11 kudos

12-04-2022 11:02:29 PM

3 More Replies

by emanuele_maffeo • New Contributor III

03-17-2022 7:55:24 AM

5794 Views
5 replies
8 kudos

Resolved! Trigger.AvailableNow on scala - compile issue

Hi everybody,Trigger.AvailableNow is released within the databricks 10.1 runtime and we would like to use this new feature with autoloader.We write all our data pipeline in scala and our projects import spark as a provided dependency. If we try to sw...

Data Engineering

5794 Views
5 replies
8 kudos

03-17-2022 7:55:24 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-17-2022 11:53:21 AM

8 kudos

You can switch to python. Depending on what you're doing and if you're using UDFs, there shouldn't be any difference at all in terms of performance.

8 kudos

03-17-2022 11:53:21 AM

4 More Replies

by User16826992666 • Databricks Employee

06-24-2021 3:06:12 PM

5548 Views
1 replies
0 kudos

How do I choose which column to partition by?

I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?

Data Engineering

5548 Views
1 replies
0 kudos

06-24-2021 3:06:12 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-24-2021 4:22:00 PM

0 kudos

The important factors deciding partition columns are:Even distribution of data. Choose the column that is commonly or widely accessed or queried. Do not create multiple levels of partition, as you can end up with a large number of small files.

0 kudos

06-24-2021 4:22:00 PM