cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

yit
by Databricks Partner
  • 1118 Views
  • 3 replies
  • 2 kudos

Resolved! Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)

Hey everyone,I’m trying to clarify a confusion in AutoLoader regarding trigger batches and micro-batches when using .forEachBatch.Here’s what I understand so far:Trigger batch – Controlled by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTr...

Data Engineering
autoloader
batch
micro-batch
spark
  • 1118 Views
  • 3 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @yit ,1. They are not quite the same. Trigger batch defines how many new files Auto Loader lists for ingestion per streaming trigger (this is controlled as you correctly pointed out by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigge...

  • 2 kudos
2 More Replies
xavier_db
by Databricks Partner
  • 482 Views
  • 1 replies
  • 0 kudos

Postgress Lakeflow connect

I want to get data from postgress using lakeflow connect for every 10 mins, how to set-up lakeflow connect, can you give step-by-step process, for creating lakeflow connect pipeline?

  • 482 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @xavier_db ,Postgres lakeflow connector is currently in private preview according to below thread:Solved: Lakeflow Connect - Postgres connector - Databricks Community - 127633But the thing is I cannot see it in Workspace Preview and Account Previe...

  • 0 kudos
ck7007
by Contributor II
  • 704 Views
  • 3 replies
  • 3 kudos

Advanced Technique

Reduced Monthly Databricks Bill from $47K to $12.7KThe Problem: We were scanning 2.3TB for queries needing only 8GB of data.Three Quick Wins1. Multi-dimensional Partitioning (30% savings)# Beforedf.write.partitionBy("date").parquet(path)# After-parti...

  • 704 Views
  • 3 replies
  • 3 kudos
Latest Reply
BS_THE_ANALYST
Databricks Partner
  • 3 kudos

@ck7007 no worries. I asked a question on the other thread: https://community.databricks.com/t5/data-engineering/cost/td-p/130078 , I'm not sure if you're classing this thread as the duplicate or the other one so I'll repost.I didn't see you mention ...

  • 3 kudos
2 More Replies
Pratikmsbsvm
by Contributor
  • 856 Views
  • 2 replies
  • 2 kudos

Resolved! Read Files from Adobe and Push to Delta table ADLS Gen2

The Upstream is sending 2 files of different schema. The Storage Account has Private Endpoints. there is no public access.no public IP (NPIP) = yes.How to design using only Databricks :-1. Databricks API to read data file from Adobe and Push it to AD...

Pratikmsbsvm_0-1756741451588.png
  • 856 Views
  • 2 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @Pratikmsbsvm ,Okay, since you’re going to use Databricks compute for data extraction and you wrote that your workspace is deployed with the secure connectivity cluster (NPIP) option enabled, you first need to make sure that you have a stable egre...

  • 2 kudos
1 More Replies
brian999
by Contributor
  • 4646 Views
  • 5 replies
  • 2 kudos

Resolved! Managing libraries in workflows with multiple tasks - need to configure a list of libs for all tasks

I have workflows with multiple tasks, each of which need 5 different libraries to run. When I have to update those libraries, I have to go in and make the update in each and every task. So for one workflow I have 20 different places where I have to g...

  • 4646 Views
  • 5 replies
  • 2 kudos
Latest Reply
brian999
Contributor
  • 2 kudos

Actually I think I found most of a solution here in one of the replies: https://community.databricks.com/t5/administration-architecture/installing-libraries-on-job-clusters/m-p/37365/highlight/true#M245It seems like I only have to define libs for the...

  • 2 kudos
4 More Replies
guilhermecs001
by New Contributor II
  • 461 Views
  • 1 replies
  • 2 kudos

How to work with 300 billions rows and 5 columns?

Hi guys!I'm having a problem at work where I need to process a customer data dataset with 300 billion rows and 5 columns. The transformations I need to perform are "simple," like joins to assign characteristics to customers. And at the end of the pro...

  • 461 Views
  • 1 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @guilhermecs001 ,Wow, that's massive amount of rows. Can you somehow preprocess first this huge CSV file? For example, read CSV, partition by some columns that makes sense (maybe country from which customer is coming from) and save that data as de...

  • 2 kudos
Sainath368
by Contributor
  • 434 Views
  • 1 replies
  • 2 kudos

Is Photon Acceleration Helpful for All Maintenance Tasks (OPTIMIZE, VACUUM, ANALYZE_COMPUTE_STATS)?

Hi everyone,We’re currently reviewing the performance impact of enabling Photon acceleration on our Databricks jobs, particularly those involving table maintenance tasks. Our job includes three main operations: OPTIMIZE, VACUUM, and ANALYZE_COMPUTE_S...

  • 434 Views
  • 1 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @Sainath368 ,I wouldn't use photon for this kind of task. You should use it primarly for ETL transformations where it shines.VACUUM and OPTIMIZE are more of maintenance tasks and using photon would be pricey overkill here.According to documentatio...

  • 2 kudos
merca
by Valued Contributor II
  • 14634 Views
  • 13 replies
  • 7 kudos

Value array {{QUERY_RESULT_ROWS}} in Databricks SQL alerts custom template

Please include in documentation an example how to incorporate the `QUERY_RESULT_ROWS` variable in the custom template.

  • 14634 Views
  • 13 replies
  • 7 kudos
Latest Reply
CJK053000
New Contributor III
  • 7 kudos

Databricks confirmed this was an issue on their end and it should be resolved now. It is working for me.

  • 7 kudos
12 More Replies
Phani1
by Databricks MVP
  • 826 Views
  • 2 replies
  • 1 kudos

Resolved! cosmosdb metadata integration with unity catalog

Hi Team,How can we integrate Cosmos DB Meta data with Unity Catalog, can you please provide some insights on this?Regards,Phani

  • 826 Views
  • 2 replies
  • 1 kudos
Latest Reply
Khaja_Zaffer
Esteemed Contributor
  • 1 kudos

Hello @Phani1 Good day:I have found a whole document on your requirementshttps://community.databricks.com/t5/technical-blog/optimising-data-integration-and-serving-patterns-with-cosmos-db/ba-p/91977 It has a project with it as well. 

  • 1 kudos
1 More Replies
Datalight
by Contributor
  • 550 Views
  • 1 replies
  • 0 kudos

Resolved! How to build Data Pipeline to consume data from Adobe Campaign to Azure Databricks

May Techie please help me design the pipeline with Databricks.I don't have any control over Adobe.How to set up a data pipeline that moves csv files from Adobe to ADLS Gen2 via a cron job, using Databricks.where this cron job will execute ? how ADLS ...

  • 550 Views
  • 1 replies
  • 0 kudos
Latest Reply
Khaja_Zaffer
Esteemed Contributor
  • 0 kudos

Hello @Datalight Good day!Can I please know what do you mean by "you dont have any control over adobe"? I found similar case study over here: https://learn.microsoft.com/en-us/answers/questions/5533633/data-pipeline-to-push-files-from-external-system...

  • 0 kudos
SangNguyen
by New Contributor III
  • 1793 Views
  • 8 replies
  • 5 kudos

Resolved! Cannot deploy DAB with the Job branch using a feature branch in Workspace UI

Hi, I tried to deploy DAB on Workspace UI with a feature branch (sf-trans-seq) targeted to Dev. After deploying successfully, the Job branch is, however, using the master branch (see the screenshot below).Is there any option to force the Job branch t...

Issue - DAB Deployment on Workspace UI.png SangNguyen_0-1756138153447.png
  • 1793 Views
  • 8 replies
  • 5 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 5 kudos

I agree.Can you mark your (or someone else´s) answer as solved?  Because I think you won´t be the only one with this issue/feature.

  • 5 kudos
7 More Replies
xavier_db
by Databricks Partner
  • 955 Views
  • 1 replies
  • 1 kudos

Resolved! Mongodb connection in GCP Databricks

I am trying to connect with Mongodb from databricks which is UC enabled, and both the mongodb and databricks are in same VPC, I am using the below code, df = ( spark.read.format("mongodb") .option( "connection.uri", f'''mongodb://{username}:{password...

  • 955 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @xavier_db ,Standard access mode has more limitations compared to dedicate access mode. For example, look at the limitations list of standard access mode:Standard compute requirements and limitations | Databricks on AWSNow, compare it to dedicated...

  • 1 kudos
fix_databricks
by New Contributor II
  • 4602 Views
  • 3 replies
  • 0 kudos

Cannot run another notebook from same directory

Hello, I am having a similar problem from this thread which was never resolved: https://community.databricks.com/t5/data-engineering/unexpected-error-while-calling-notebook-string-matching-regex-w/td-p/18691 I renamed a notebook (utility_data_wrangli...

  • 4602 Views
  • 3 replies
  • 0 kudos
Latest Reply
ddundovic
New Contributor III
  • 0 kudos

I am running into the same issue. It seems like the `%run` magic command is trying to parse the entire cell content as its arguments. So if you have%run "my_notebook" print("hello")in the same cell, you will get the following error: `Failed to parse...

  • 0 kudos
2 More Replies
Raj_DB
by Contributor
  • 3630 Views
  • 9 replies
  • 12 kudos

Resolved! Pass Notebook parameters dynamically in Job task.

Hi Everyone, I'm working on scheduling a job and would like to pass parameters that I've defined in my notebook. Ideally, I'd like these parameters to be dynamic meaning that if I update their values in the notebook, the scheduled job should automati...

Raj_DB_0-1756383510542.png
  • 3630 Views
  • 9 replies
  • 12 kudos
Latest Reply
ck7007
Contributor II
  • 12 kudos

I see you're using dbutils.widgets. text and dropdown—perfect! You're already on the right track.Quick SolutionYour widgets are already dynamic! Just pass parameters in your job configuration:In your notebook (slight refactor of your code):# Define w...

  • 12 kudos
8 More Replies
Erik
by Valued Contributor III
  • 20437 Views
  • 13 replies
  • 8 kudos

Grafana + databricks = True?

We have some timeseries in databricks, and we are reading them into powerbi through sql compute endpoints. For timeseries powerbi is ... not optimal. Earlier I have used grafana with various backends, and quite like it, but I cant find any way to con...

  • 20437 Views
  • 13 replies
  • 8 kudos
Latest Reply
frugson
New Contributor II
  • 8 kudos

@Erik wrote:We have some timeseries in databricks, and we are reading them into powerbi through sql compute endpoints. For timeseries powerbi is ... not optimal. Earlier I have used grafana with various backends, and quite like it, but I cant find an...

  • 8 kudos
12 More Replies
Labels