by
sanjay
• Valued Contributor II
- 7621 Views
- 0 replies
- 0 kudos
Hi,I have data pipeline which is running continuously, processes the micro batch data and store data in delta lake. This is taking care of any new data.But at times, I need to process historical data without disturbing real time data processing.Is th...
- 7621 Views
- 0 replies
- 0 kudos
- 1010 Views
- 1 replies
- 1 kudos
Hi all,When I design a streaming data pipeline with incoming moving files and used apply chnge function on silver table comparing change between bronze and silver for removing duplicates based on key columns, do you know why I got ignore change to tr...
- 1010 Views
- 1 replies
- 1 kudos
Latest Reply
@Raymond Huang :The error message "ignore changes to true" typically occurs when you are trying to apply changes to a table using Delta Lake's change data capture (CDC) feature, but you have set the option ignoreChanges to true. This option tells De...
- 3407 Views
- 3 replies
- 1 kudos
Delta Lake provides optimizations that can help you accelerate your data lake operations. Here’s how you can improve query speed by optimizing the layout of data in storage.There are two ways you can optimize your data pipeline: 1) Notebook Optimizat...
- 3407 Views
- 3 replies
- 1 kudos
Latest Reply
some tips from me:Look for data skews; some partitions can be huge, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()), especially SQL can divide it unequally to parti...
2 More Replies
- 3981 Views
- 7 replies
- 3 kudos
I've just learned Delta Live Tables on Databricks Academy and have no environment to try it out.I'm wondering what happens to the pipeline if the notebook consists of both normal tables and DLTs. For exampleTable ADLT A that reads and cleans Table AT...
- 3981 Views
- 7 replies
- 3 kudos
Latest Reply
hey ,@S L According to you , you have normal table table A and DLT table Table B , so it will give thrown an error that your upstream table is not streaming Live table and you need to create streaming live table Table a , if you want to use the ou...
6 More Replies
- 4194 Views
- 4 replies
- 11 kudos
Tip: These steps are built out for AWS accounts and workspaces that are using Delta Lake. If you would like to learn more watch this video and reach out to your Databricks sales representative for more information.Step 1: Create your own notebook or ...
- 4194 Views
- 4 replies
- 11 kudos
- 3746 Views
- 5 replies
- 8 kudos
Hi everybody,Trigger.AvailableNow is released within the databricks 10.1 runtime and we would like to use this new feature with autoloader.We write all our data pipeline in scala and our projects import spark as a provided dependency. If we try to sw...
- 3746 Views
- 5 replies
- 8 kudos
Latest Reply
You can switch to python. Depending on what you're doing and if you're using UDFs, there shouldn't be any difference at all in terms of performance.
4 More Replies
- 4374 Views
- 1 replies
- 0 kudos
I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?
- 4374 Views
- 1 replies
- 0 kudos
Latest Reply
The important factors deciding partition columns are:Even distribution of data. Choose the column that is commonly or widely accessed or queried. Do not create multiple levels of partition, as you can end up with a large number of small files.