Data Engineering

Forum Posts

Sorted by:

by Mr__E • Contributor II

08-29-2022 2:15:45 PM

2044 Views
1 replies
3 kudos

Sync prod WS DBs to dev WS DBs

We have a couple sources we'd already set up to stream to prod using a 3p system. Is there a way to sync this directly to our dev workspace to build pipelines? eg. directly connecting to a cluster in prod and pull with a job cluster, dump to S3 and u...

Data Engineering

2044 Views
1 replies
3 kudos

08-29-2022 2:15:45 PM

View Replies

Latest Reply

Debayan
Databricks Employee

08-30-2022 1:22:38 PM

3 kudos

DBFS can be used in many ways. Please refer below: Allows you to interact with object storage using directory and file semantics instead of cloud-specific API commands.Allows you to mount cloud object storage locations so that you can map storage cre...

3 kudos

08-30-2022 1:22:38 PM

by ftc • New Contributor II

08-02-2022 1:22:08 PM

4621 Views
3 replies
0 kudos

Resolved! Multi-Hop Architecture for ingestion data via http API

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? ...

Data Engineering

4621 Views
3 replies
0 kudos

08-02-2022 1:22:08 PM

View Replies

Latest Reply

artsheiko
Databricks Employee

08-03-2022 10:57:24 AM

0 kudos

The API -> Cloud Storage -> Delta is more suitable approach.Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference ev...

0 kudos

08-03-2022 10:57:24 AM

2 More Replies

by Constantine • Contributor III

03-24-2022 8:39:56 AM

3229 Views
1 replies
4 kudos

Resolved! How to process a large delta table with UDF ?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another columnMy code is something like thisdef my_udf(data): return pass udf_func = udf(my_udf, StringType()) data...

Data Engineering

3229 Views
1 replies
4 kudos

03-24-2022 8:39:56 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

03-24-2022 8:52:10 AM

4 kudos

That udf code will run on driver so better not use it for such a big dataset. What you need is vectorized pandas udf https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

4 kudos

03-24-2022 8:52:10 AM

by alphaRomeo • New Contributor

08-21-2021 11:58:14 AM

5799 Views
2 replies
0 kudos

Resolved! DataBricks with MySQL data source?

I have an existing data pipeline which looks like this: A small MySQL data source (around 250 GB) and data passes through Debezium/ Kafka / a custom data redactor -> to Glue ETL jobs and finally lands on Redshift, but the scale of the data is too sm...

Data Engineering

5799 Views
2 replies
0 kudos

08-21-2021 11:58:14 AM

View Replies

Latest Reply

Dan_Z
Databricks Employee

09-09-2021 1:57:37 PM

0 kudos

There is a lot in this question, so generally speaking I suggest you reach out to the sales team at Databricks. You can talk to a solutions architect who get into more detail. Here are my general thoughts having seen a lot of customer arch:Generally,...

0 kudos

09-09-2021 1:57:37 PM

1 More Replies

Databricks Community

Sync prod WS DBs to dev WS DBs

Resolved! Multi-Hop Architecture for ingestion data via http API

Resolved! How to process a large delta table with UDF ?

Resolved! DataBricks with MySQL data source?