cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Mr__E
by Contributor II
  • 1167 Views
  • 2 replies
  • 3 kudos

Sync prod WS DBs to dev WS DBs

We have a couple sources we'd already set up to stream to prod using a 3p system. Is there a way to sync this directly to our dev workspace to build pipelines? eg. directly connecting to a cluster in prod and pull with a job cluster, dump to S3 and u...

  • 1167 Views
  • 2 replies
  • 3 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 3 kudos

Hi @Erik Louie​ , We haven't heard from you on the last response from @Debayan Mukherjee​, and I was checking back to see if his suggestions helped you. Or else, If you have any solution, please share it with the community as it can be helpful to oth...

  • 3 kudos
1 More Replies
ftc
by New Contributor II
  • 2326 Views
  • 3 replies
  • 0 kudos

Resolved! Multi-Hop Architecture for ingestion data via http API

I'd like to know what is the design pattern for ingesting data via http API request. The pattern needs use multi-hop architecture. Do we need ingest JSON output to cloud storage first (not bronze layer), then use auto loader to process data further? ...

  • 2326 Views
  • 3 replies
  • 0 kudos
Latest Reply
artsheiko
Honored Contributor
  • 0 kudos

The API -> Cloud Storage -> Delta is more suitable approach.Auto Loader helps not to lose any data (it keeps track of discovered files in the checkpoint location using RocksDB to provide exactly-once ingestion guarantees), enables schema inference ev...

  • 0 kudos
2 More Replies
Constantine
by Contributor III
  • 1932 Views
  • 1 replies
  • 4 kudos

Resolved! How to process a large delta table with UDF ?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another columnMy code is something like thisdef my_udf(data): return pass   udf_func = udf(my_udf, StringType()) data...

  • 1932 Views
  • 1 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

That udf code will run on driver so better not use it for such a big dataset. What you need is vectorized pandas udf https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

  • 4 kudos
alphaRomeo
by New Contributor
  • 2736 Views
  • 2 replies
  • 0 kudos

Resolved! DataBricks with MySQL data source?

I have an existing data pipeline which looks like this: A small MySQL data source (around 250 GB) and data passes through Debezium/ Kafka / a custom data redactor -> to Glue ETL jobs and finally lands on Redshift, but the scale of the data is too sm...

  • 2736 Views
  • 2 replies
  • 0 kudos
Latest Reply
Dan_Z
Honored Contributor
  • 0 kudos

There is a lot in this question, so generally speaking I suggest you reach out to the sales team at Databricks. You can talk to a solutions architect who get into more detail. Here are my general thoughts having seen a lot of customer arch:Generally,...

  • 0 kudos
1 More Replies
Labels