Data Engineering

Forum Posts

Sorted by:

by CarterM • New Contributor III

09-29-2022 4:46:56 PM

7040 Views
3 replies
2 kudos

Resolved! Why Spark Streaming from S3 is returning thousands of files when there are only 9?

I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 buck...

Data Engineering

7040 Views
3 replies
2 kudos

09-29-2022 4:46:56 PM

View Replies

Latest Reply

Anonymous
Not applicable

10-03-2022 3:04:39 PM

2 kudos

@Carter Mooring Thank you SO MUCH for coming back to provide a solution to your thread! Happy you were able to figure this out so quickly. And I am sure that this will help someone in the future with the same issue.

2 kudos

10-03-2022 3:04:39 PM

2 More Replies

by Ajay-Pandey • Esteemed Contributor III

02-23-2023 3:30:55 AM

2067 Views
2 replies
7 kudos

docs.databricks.com

Rename and drop columns with Delta Lake column mapping. Hi all,Now databricks started supporting column rename and drop.Column mapping requires the following Delta protocols:Reader version 2 or above.Writer version 5 or above.Blog URL##Available in D...

Data Engineering

2067 Views
2 replies
7 kudos

02-23-2023 3:30:55 AM

View Replies

Latest Reply

Poovarasan
New Contributor III

03-03-2024 9:51:03 PM

7 kudos

Above mentioned feature is not working in the DLT pipeline. if the scrip has more than 4 columns

7 kudos

03-03-2024 9:51:03 PM

1 More Replies

by Databricks_POC • New Contributor II

12-20-2021 1:14:14 AM

21201 Views
5 replies
6 kudos

Resolved! I want to compare two data frames. In output I wish to see unmatched Rows and the columns identified leading to the differences.

Data Engineering

21201 Views
5 replies
6 kudos

12-20-2021 1:14:14 AM

View Replies

Latest Reply

bhargavi1
New Contributor II

04-28-2022 1:53:19 AM

6 kudos

@vinita shinde are you Cracked this Code?

6 kudos

04-28-2022 1:53:19 AM

4 More Replies

by geetha_venkates • New Contributor II

12-08-2021 8:26:47 AM

10767 Views
7 replies
2 kudos

Resolved! How do we add a certificate file in Databricks for sparksubmit type of job?

How do we add a certificate file in Databricks for sparksubmit type of job?

Data Engineering

10767 Views
7 replies
2 kudos

12-08-2021 8:26:47 AM

View Replies

Latest Reply

nicozambelli
New Contributor II

12-13-2023 8:22:45 AM

2 kudos

I have the same problem... when i worked with the hive_metastore in past, i was able tu use file system and also use API certs.Now i'm using the unity catalog and i can't upload a certificate, can somebody help me?

2 kudos

12-13-2023 8:22:45 AM

6 More Replies

by dispersion • New Contributor

02-13-2023 5:39:50 AM

1682 Views
2 replies
1 kudos

Running large volume of SQL queries in Python notebooks. How to minimise overheads/maintenance.

I have around 200 SQL queries id like to run in databricks python notebooks. Id like to avoid creating an ETL process for each of the 200 SQL processes.Any suggestions on how to run the queries in a way that it loops through them so i have minimum am...

Data Engineering

1682 Views
2 replies
1 kudos

02-13-2023 5:39:50 AM

View Replies

Latest Reply

Anonymous
Not applicable

02-21-2023 2:17:23 AM

1 kudos

Hi @Chris French Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thank...

1 kudos

02-21-2023 2:17:23 AM

1 More Replies

by KVNARK • Honored Contributor II

12-10-2022 10:39:35 AM

5797 Views
4 replies
11 kudos

Resolved! Pyspark learning path

Can anyone suggest to take the best series of courses offered by Databricks to learn pyspark for ETL purpose either in Databricks partner learning portal or Databricks learning portal.

Data Engineering

5797 Views
4 replies
11 kudos

12-10-2022 10:39:35 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

12-10-2022 1:23:42 PM

11 kudos

To learn Databricks ETL, I highy recommend videos made by Simon on that channel https://www.youtube.com/@AdvancingAnalytics

11 kudos

12-10-2022 1:23:42 PM

3 More Replies

by BkP • Contributor

06-10-2022 4:34:49 AM

7813 Views
14 replies
9 kudos

Suggestion Needed for a Orchestrator/Scheduler to schedule and execute Jobs in an automated way

Hello Friends,We have an application which extracts dat from various tables in Azure Databricks and we extract it to postgres tables (postgres installed on top of Azure VMs). After extraction we apply transformation on those datasets in postgres tabl...

Data Engineering

7813 Views
14 replies
9 kudos

06-10-2022 4:34:49 AM

View Replies

Latest Reply

VaibB
Contributor

12-05-2022 9:29:45 PM

9 kudos

You can leverage Airflow, which provides a connector for databricks jobs API, or can use databricks workflow to orchestrate your jobs where you can define several tasks and set dependencies accordingly.

9 kudos

12-05-2022 9:29:45 PM

13 More Replies

by User16835756816 • Valued Contributor

11-28-2022 12:04:54 PM

4209 Views
4 replies
11 kudos

How can I extract data from different sources and transform it into a fresh, reliable data pipeline?

Tip: These steps are built out for AWS accounts and workspaces that are using Delta Lake. If you would like to learn more watch this video and reach out to your Databricks sales representative for more information.Step 1: Create your own notebook or ...

Data Engineering

4209 Views
4 replies
11 kudos

11-28-2022 12:04:54 PM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

12-04-2022 11:02:29 PM

11 kudos

Thanks @Nithya Thangaraj

11 kudos

12-04-2022 11:02:29 PM

3 More Replies

by Doug1 • Contributor

10-07-2022 2:46:10 AM

17093 Views
26 replies
56 kudos

What ETL/ELT used the most within this group?

Data Engineering

17093 Views
26 replies
56 kudos

10-07-2022 2:46:10 AM

View Replies

Latest Reply

Rajeev_Basu
Contributor III

11-30-2022 10:08:11 PM

56 kudos

For Azure, I have used ADF, Azure Databricks and Synapse.

56 kudos

11-30-2022 10:08:11 PM

25 More Replies

by KVNARK • Honored Contributor II

11-22-2022 4:36:38 AM

5824 Views
8 replies
28 kudos

Resolved! Can we use Databricks or code in data bricks without learning Pyspark in depth which is used for ETL purpose and data engineering perspective.

Can we use Databricks or code in data bricks without learning Pyspark in depth which is used for ETL purpose and data engineering perspective. can someone throw some light on this. Currently learning Pyspark (basics of Pythion in handling the data) a...

Data Engineering

5824 Views
8 replies
28 kudos

11-22-2022 4:36:38 AM

View Replies

Latest Reply

KVNARK
Honored Contributor II

11-23-2022 1:01:08 AM

28 kudos

Thanks All for your valuable suggestions!

28 kudos

11-23-2022 1:01:08 AM

7 More Replies

by vjraitila • New Contributor III

08-03-2022 10:07:24 AM

2109 Views
3 replies
5 kudos

Strategy for streaming ETL and Delta Lake before Delta Live Tables existed

What was the established architectural pattern for doing streaming ETL with Delta Lake before DLT was a thing? And incidentally, what approach would you take in the context of delta-oss today? The pipeline definitions would not have had to be declara...

Data Engineering

2109 Views
3 replies
5 kudos

08-03-2022 10:07:24 AM

View Replies

Latest Reply

Vidula
Honored Contributor

09-07-2022 12:25:08 AM

5 kudos

Hi @Veli-Jussi Raitila Does @Shanmugavel Chandrakasu response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

5 kudos

09-07-2022 12:25:08 AM

2 More Replies

by Zair • New Contributor III

08-06-2022 2:15:58 PM

2039 Views
2 replies
2 kudos

How to handle 100+ tables ETL through spark structured streaming?

I am writing a streaming job which will be performing ETL for more than 130 tables. I would like to know is there any other better way to do this. Another solution I am thinking is to write separate streaming job for all tables. source data is coming...

Data Engineering

2039 Views
2 replies
2 kudos

08-06-2022 2:15:58 PM

View Replies

Latest Reply

artsheiko
Databricks Employee

08-07-2022 6:10:46 AM

2 kudos

Hi, I guess to answer your question it might be helpful to get more details on what you're trying to achieve and the bottleneck that you encounter now.Indeed handle the processing of 130 tables in one monolith could be challenging as the business rul...

2 kudos

08-07-2022 6:10:46 AM

1 More Replies

by sage5616 • Valued Contributor

08-03-2022 3:06:05 PM

19022 Views
3 replies
2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

Data Engineering

19022 Views
3 replies
2 kudos

08-03-2022 3:06:05 PM

View Replies

Latest Reply

Anonymous
Not applicable

08-07-2022 1:25:11 PM

2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

2 kudos

08-07-2022 1:25:11 PM

2 More Replies

by Kash • Contributor III

06-09-2022 6:49:15 AM

17959 Views
18 replies
13 kudos

Resolved! HELP! Converting GZ JSON to Delta causes massive CPU spikes and ETL's take days!

Hi there,I was wondering if I could get your advise.We would like to create a bronze delta table using GZ JSON data stored in S3 but each time we attempt to read and write it our clusters CPU spikes to 100%. We are not doing any transformations but s...

Data Engineering

17959 Views
18 replies
13 kudos

06-09-2022 6:49:15 AM

View Replies

Latest Reply

Kash
Contributor III

06-15-2022 5:47:02 AM

13 kudos

Hi Kaniz,Thanks for the note and thank you everyone for the suggestions and help. @Joseph Kambourakis I aded your suggestion to our load but I did not see any change in how our data loads or the time it takes to load data. I've done some additional ...

13 kudos

06-15-2022 5:47:02 AM

17 More Replies

by BeginnerBob • New Contributor III

06-15-2022 1:02:15 PM

2631 Views
3 replies
1 kudos

Loading Dimensions including SCDType2

I have a customer dimension and for every incremental load I am applying type2 or type1 to the dimension.This dimension is based off a silver table in my delta lake where I am applying a merge statement.What happens if I need to go back and track ad...

Data Engineering

2631 Views
3 replies
1 kudos

06-15-2022 1:02:15 PM

View Replies

Latest Reply

BeginnerBob
New Contributor III

06-17-2022 2:26:47 AM

1 kudos

Thanks werners,I was informed you could essentially recreate a type 2 dimensions from scratch, without reading the files 1 by 1, using the delta lake time shift. However, this doesn't seem to be the case and the only way to create this is to incremen...

1 kudos

06-17-2022 2:26:47 AM

2 More Replies