cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

CarterM
by New Contributor III
  • 5882 Views
  • 4 replies
  • 2 kudos

Resolved! Why Spark Streaming from S3 is returning thousands of files when there are only 9?

I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 buck...

8_9 endpoint response structure Soccer  endpoint  9 9 endpoint responses in same s3 bucket
  • 5882 Views
  • 4 replies
  • 2 kudos
Latest Reply
williamyoung
New Contributor II
  • 2 kudos

Hello Everyone,It seems like the issue you're encountering could be related to how Spark Streaming interprets the S3 file structure, especially when dealing with multiple sources. When files from multiple endpoints are stored in the same bucket, Spar...

  • 2 kudos
3 More Replies
Ajay-Pandey
by Esteemed Contributor III
  • 1797 Views
  • 2 replies
  • 7 kudos

docs.databricks.com

Rename and drop columns with Delta Lake column mapping. Hi all,Now databricks started supporting column rename and drop.Column mapping requires the following Delta protocols:Reader version 2 or above.Writer version 5 or above.Blog URL##Available in D...

  • 1797 Views
  • 2 replies
  • 7 kudos
Latest Reply
Poovarasan
New Contributor III
  • 7 kudos

Above mentioned feature is not working in the DLT pipeline. if the scrip has more than 4 columns 

  • 7 kudos
1 More Replies
geetha_venkates
by New Contributor II
  • 9853 Views
  • 7 replies
  • 2 kudos

Resolved! How do we add a certificate file in Databricks for sparksubmit type of job?

How do we add a certificate file in Databricks for sparksubmit type of job? 

  • 9853 Views
  • 7 replies
  • 2 kudos
Latest Reply
nicozambelli
New Contributor II
  • 2 kudos

I have the same problem... when i worked with the hive_metastore in past, i was able tu use file system and also use API certs.Now i'm using the unity catalog and i can't upload a certificate, can somebody help me?

  • 2 kudos
6 More Replies
dispersion
by New Contributor
  • 1517 Views
  • 2 replies
  • 1 kudos

Running large volume of SQL queries in Python notebooks. How to minimise overheads/maintenance.

I have around 200 SQL queries id like to run in databricks python notebooks. Id like to avoid creating an ETL process for each of the 200 SQL processes.Any suggestions on how to run the queries in a way that it loops through them so i have minimum am...

  • 1517 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Chris French​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thank...

  • 1 kudos
1 More Replies
KVNARK
by Honored Contributor II
  • 4509 Views
  • 4 replies
  • 11 kudos

Resolved! Pyspark learning path

Can anyone suggest to take the best series of courses offered by Databricks to learn pyspark for ETL purpose either in Databricks partner learning portal or Databricks learning portal.

  • 4509 Views
  • 4 replies
  • 11 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 11 kudos

To learn Databricks ETL, I highy recommend videos made by Simon on that channel https://www.youtube.com/@AdvancingAnalytics

  • 11 kudos
3 More Replies
BkP
by Contributor
  • 7040 Views
  • 14 replies
  • 9 kudos

Suggestion Needed for a Orchestrator/Scheduler to schedule and execute Jobs in an automated way

Hello Friends,We have an application which extracts dat from various tables in Azure Databricks and we extract it to postgres tables (postgres installed on top of Azure VMs). After extraction we apply transformation on those datasets in postgres tabl...

image
  • 7040 Views
  • 14 replies
  • 9 kudos
Latest Reply
VaibB
Contributor
  • 9 kudos

You can leverage Airflow, which provides a connector for databricks jobs API, or can use databricks workflow to orchestrate your jobs where you can define several tasks and set dependencies accordingly.

  • 9 kudos
13 More Replies
User16835756816
by Valued Contributor
  • 3803 Views
  • 4 replies
  • 11 kudos

How can I extract data from different sources and transform it into a fresh, reliable data pipeline?

Tip: These steps are built out for AWS accounts and workspaces that are using Delta Lake. If you would like to learn more watch this video and reach out to your Databricks sales representative for more information.Step 1: Create your own notebook or ...

  • 3803 Views
  • 4 replies
  • 11 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 11 kudos

Thanks @Nithya Thangaraj​ 

  • 11 kudos
3 More Replies
KVNARK
by Honored Contributor II
  • 5155 Views
  • 8 replies
  • 28 kudos

Resolved! Can we use Databricks or code in data bricks without learning Pyspark in depth which is used for ETL purpose and data engineering perspective.

Can we use Databricks or code in data bricks without learning Pyspark in depth which is used for ETL purpose and data engineering perspective. can someone throw some light on this. Currently learning Pyspark (basics of Pythion in handling the data) a...

  • 5155 Views
  • 8 replies
  • 28 kudos
Latest Reply
KVNARK
Honored Contributor II
  • 28 kudos

Thanks All for your valuable suggestions!

  • 28 kudos
7 More Replies
vjraitila
by New Contributor III
  • 1895 Views
  • 3 replies
  • 5 kudos

Strategy for streaming ETL and Delta Lake before Delta Live Tables existed

What was the established architectural pattern for doing streaming ETL with Delta Lake before DLT was a thing? And incidentally, what approach would you take in the context of delta-oss today? The pipeline definitions would not have had to be declara...

  • 1895 Views
  • 3 replies
  • 5 kudos
Latest Reply
Vidula
Honored Contributor
  • 5 kudos

Hi @Veli-Jussi Raitila​ Does @Shanmugavel Chandrakasu​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

  • 5 kudos
2 More Replies
Zair
by New Contributor III
  • 1768 Views
  • 2 replies
  • 2 kudos

How to handle 100+ tables ETL through spark structured streaming?

I am writing a streaming job which will be performing ETL for more than 130 tables. I would like to know is there any other better way to do this. Another solution I am thinking is to write separate streaming job for all tables. source data is coming...

  • 1768 Views
  • 2 replies
  • 2 kudos
Latest Reply
artsheiko
Databricks Employee
  • 2 kudos

Hi, I guess to answer your question it might be helpful to get more details on what you're trying to achieve and the bottleneck that you encounter now.Indeed handle the processing of 130 tables in one monolith could be challenging as the business rul...

  • 2 kudos
1 More Replies
sage5616
by Valued Contributor
  • 16492 Views
  • 3 replies
  • 2 kudos

Resolved! Choosing the optimal cluster size/specs.

Hello everyone,I am trying to determine the appropriate cluster specifications/sizing for my workload:Run a PySpark task to transform a batch of input avro files to parquet files and create or re-create persistent views on these parquet files. This t...

  • 16492 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

If the data is 100MB, then I'd try a single node cluster, which will be the smallest and least expensive. You'll have more than enough memory to store it all. You can automate this and use a jobs cluster.

  • 2 kudos
2 More Replies
Kash
by Contributor III
  • 11867 Views
  • 18 replies
  • 13 kudos

Resolved! HELP! Converting GZ JSON to Delta causes massive CPU spikes and ETL's take days!

Hi there,I was wondering if I could get your advise.We would like to create a bronze delta table using GZ JSON data stored in S3 but each time we attempt to read and write it our clusters CPU spikes to 100%. We are not doing any transformations but s...

  • 11867 Views
  • 18 replies
  • 13 kudos
Latest Reply
Kash
Contributor III
  • 13 kudos

Hi Kaniz,Thanks for the note and thank you everyone for the suggestions and help. @Joseph Kambourakis​ I aded your suggestion to our load but I did not see any change in how our data loads or the time it takes to load data. I've done some additional ...

  • 13 kudos
17 More Replies
BeginnerBob
by New Contributor III
  • 2421 Views
  • 3 replies
  • 1 kudos

Loading Dimensions including SCDType2

I have a customer dimension and for every incremental load I am applying type2 or type1 to the dimension.This dimension is based off a silver table in my delta lake where I am applying a merge statement.What happens if I need to go back and track ad...

  • 2421 Views
  • 3 replies
  • 1 kudos
Latest Reply
BeginnerBob
New Contributor III
  • 1 kudos

Thanks werners,I was informed you could essentially recreate a type 2 dimensions from scratch, without reading the files 1 by 1, using the delta lake time shift. However, this doesn't seem to be the case and the only way to create this is to incremen...

  • 1 kudos
2 More Replies
Labels