cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

mjedy7
by New Contributor II
  • 1394 Views
  • 1 replies
  • 0 kudos

Reading two big tables within each forEachBatch processing method

I am reading changes from the cdf with availableOnce=True, processing data from checkpoint to checkpoint. During each batch, I perform transformations, but I also need to read two large tables and one small table. Does Spark read these tables from sc...

  • 1394 Views
  • 1 replies
  • 0 kudos
Latest Reply
radothede
Valued Contributor II
  • 0 kudos

Hi @mjedy7 for cacheing in this scenario You could try to levarage persist() and unpersist() for the big table/ spark dataframe, see here:https://medium.com/@eloutmadiabderrahim/persist-vs-unpersist-in-spark-485694f72452Try to reduce the amount of da...

  • 0 kudos
hk-modi
by New Contributor
  • 1168 Views
  • 1 replies
  • 0 kudos

Switching to autoloader

I have an S3 bucket that has continuous data being written into it. My script reads these files, parses them and then appends into a delta table. The data backs to 2022 with millions of files which are stored using partitions based on year/month/dayO...

  • 1168 Views
  • 1 replies
  • 0 kudos
Latest Reply
radothede
Valued Contributor II
  • 0 kudos

hi @hk-modi As I understand correctly, You have an existing delta table with tons of data already processed. You want to switch to autoloader, read files, parse them and process data incrementally to that delta table as a sink. The task is to start p...

  • 0 kudos
lauraxyz
by Contributor
  • 2248 Views
  • 3 replies
  • 1 kudos

Resolved! Rendering Volumes file content programmatically

Hi there!I have some files stored in Volume, and I have a use case that I need to show the file content in a UI.  Say I have a REST API that already knows the Volume path to the file, is there any built-in feature from Databricks that i can use to he...

  • 2248 Views
  • 3 replies
  • 1 kudos
Latest Reply
cgrant
Databricks Employee
  • 1 kudos

Hi @lauraxyz, the files API should be helpful, particularly the upload endpoint.

  • 1 kudos
2 More Replies
NehaR
by New Contributor III
  • 1028 Views
  • 1 replies
  • 3 kudos

Cost estimation before query execution similar to google cloud Big Query equivalent of --dry_run

Hi , In databricks do we have a option to estimate cost of query before execution which is similar to Big Query equivalent of --dry_run.Our use case is to estimate cost before execution and get alerted. RegardsNeha   

  • 1028 Views
  • 1 replies
  • 3 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 3 kudos

Hello @NehaR, Currently, Databricks does not have a direct equivalent to BigQuery's --dry_run feature for estimating the cost of a query before execution. However, there are some mechanisms and ongoing projects that aim to provide similar functionali...

  • 3 kudos
Shivam_Pawar
by New Contributor III
  • 21718 Views
  • 15 replies
  • 5 kudos

Databricks Lakehouse Fundamentals Badge

I have successfully passed the test after completion of the course with 95%. But I have'nt recieved any badge from your side as promised. I have been provided with a certificate which looks fake by itself. I need to post my credentials on Linkedin wi...

  • 21718 Views
  • 15 replies
  • 5 kudos
Latest Reply
heybeckerj
New Contributor II
  • 5 kudos

Any feedback on this please? 

  • 5 kudos
14 More Replies
deecee
by Databricks Partner
  • 1734 Views
  • 2 replies
  • 0 kudos

SAS token issue for long running micro-batches

Hi everyone,I'm having an issue with some of our Databricks workloads. We're processing these workloads using the forEachBatch stream processing method. Whenever we are performing a full reload on some of our datasources, we get the following error. ...

Data Engineering
azure
Unity Catalog
  • 1734 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@deecee  Can you please confirm there are no external locations or volumes which can lead to this overlap of locations? what you actually have in "some_catalog.some_schema.some_table" and the "abfss://some-container@somestorageaccount.dfs.core.window...

  • 0 kudos
1 More Replies
kalebkemp
by New Contributor
  • 1487 Views
  • 2 replies
  • 0 kudos

FileReadException error when creating materialized view reading two schemas

Hi all. I'm getting an error `com.databricks.sql.io.FileReadException` when attempting to create a materialized view which reads tables from two different schemas in the same catalog. Is this just a limitation in databricks or do I potentially have s...

  • 1487 Views
  • 2 replies
  • 0 kudos
Latest Reply
agallard
Contributor
  • 0 kudos

Hi @kalebkemp ,The error you're encountering (com.databricks.sql.io.FileReadException) when creating a materialized view that reads from two different schemas in the same catalog might not necessarily be a Databricks limitation. It is more likely rel...

  • 0 kudos
1 More Replies
zmsoft
by Contributor
  • 9042 Views
  • 6 replies
  • 6 kudos

Azure Synapse vs Databricks

Hi there,I would like to know the difference between Azure Databricks and Azure Synapse, which use case is Databricks appropriate and which use case is Synapse appropriate? What are the differences in their functions? What are the differences in thei...

  • 9042 Views
  • 6 replies
  • 6 kudos
Latest Reply
thelogicplus
Contributor II
  • 6 kudos

share you use case i will suggest you about technology difference and which could be benefical for you. I love Data brick due to many awesome feature that help sql developer to programmer(python/Scala) to solve the use case on DataBricks. but if you ...

  • 6 kudos
5 More Replies
Pradeep_Namani
by New Contributor III
  • 970 Views
  • 1 replies
  • 0 kudos

Getting Different results when I am running the Global Templary table (Transform Query)

We are running a logic in an Azure Databricks notebook using Python with Spark. Initially, we read data from ADLS and load it into a global temporary table to perform data quality checks. We then recreate the same temporary table. Afterward, we use t...

Pradeep_Namani_1-1732529251887.png Pradeep_Namani_0-1732529202334.png
  • 970 Views
  • 1 replies
  • 0 kudos
Latest Reply
Pradeep_Namani
New Contributor III
  • 0 kudos

The script is too large to paste in here, so please get in touch with me to to obtain it.

  • 0 kudos
sanjay
by Valued Contributor II
  • 36567 Views
  • 21 replies
  • 18 kudos

Resolved! How to limit number of files in each batch in streaming batch processing

Hi,I am running batch job which processes incoming files. I am trying to limit number of files in each batch process so added maxFilesPerTrigger option. But its not working. It processes all incoming files at once.(spark.readStream.format("delta").lo...

  • 36567 Views
  • 21 replies
  • 18 kudos
Latest Reply
mjedy7
New Contributor II
  • 18 kudos

Hi @Sandeep ,Can we usespark.readStream.format("delta").option(""maxBytesPerTrigger", "50G").load(silver_path).writeStream.option("checkpointLocation", gold_checkpoint_path).trigger(availableNow=True).foreachBatch(foreachBatchFunction).start() 

  • 18 kudos
20 More Replies
Jefke
by New Contributor III
  • 4277 Views
  • 5 replies
  • 2 kudos

Resolved! Cloud_files function

Hi I'm fairly new to to Databricks and in some examples, blogs,... I see the cloud_files() function being used. But I'm always unable to find any documentation on it? Is there any reason for this? And what is the exact use case for the function? Most...

  • 4277 Views
  • 5 replies
  • 2 kudos
Latest Reply
JissMathew
Valued Contributor
  • 2 kudos

Hi @Jefke ,The cloud_files() function in Databricks is part of the Databricks Auto Loader, a tool used for incremental data ingestion from cloud storage like Azure Blob Storage, Amazon S3, or Google Cloud Storage. This function is specifically optimi...

  • 2 kudos
4 More Replies
Skully
by New Contributor
  • 1221 Views
  • 1 replies
  • 0 kudos

Workflow Fail safe query

I have a large SQL query that includes multiple Common Table Expressions (CTEs) and joins across various tables, totaling approximately 2,500 lines. I want to ensure that if any part of the query or a specific CTE fails—due to a missing table or colu...

  • 1221 Views
  • 1 replies
  • 0 kudos
Latest Reply
LingeshK
Databricks Employee
  • 0 kudos

There are few options you can try. Based of the information shared, I am assuming a skeleton for you complicated query as follows: WITH cte_one AS (SELECT *FROM view_one),-- Other CTEs...-- Your main query logicSELECTFROM cte_one-- Joins and other cl...

  • 0 kudos
Krizofe
by New Contributor II
  • 9699 Views
  • 7 replies
  • 5 kudos

Resolved! Migrating data from synapse to databricks

Hello team,I have a requirement of moving all the table from Azure Synapse (dedicated sql pool) to databricks.we have a data coming up from source to azure data lake frequently.we have Azure data factory to load data (data flow does the basic transfo...

  • 9699 Views
  • 7 replies
  • 5 kudos
Latest Reply
thelogicplus
Contributor II
  • 5 kudos

Hi @Krizofe , Just gone through you deatils and thought our similar experience  with  Azure Synapse to databrick migration. We faced a similar situation and were initially hesitant, One of the my colleague recommanded to use Travinto Technologies acc...

  • 5 kudos
6 More Replies
somedeveloper
by New Contributor III
  • 1996 Views
  • 3 replies
  • 0 kudos

Databricks Setting Dynamic Local Configuration Properties

It seems that Databricks is somehow setting the properties of local spark configurations for each notebook. Can someone point me to exactly how and where this is being done? I would like to set the scheduler to utilize a certain pool by default, but ...

  • 1996 Views
  • 3 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

You will need to leverage cluster-level Spark configurations or global init scripts.  This will allow you to set "spark.scheduler.poo" property automatically for all workloads on the cluster. You can try navigationg to "Compute", select the cluster y...

  • 0 kudos
2 More Replies
Sega2
by New Contributor III
  • 3078 Views
  • 1 replies
  • 0 kudos

cannot import name 'Buffer' from 'typing_extensions' (/databricks/python/lib/python3.10/site-package

I am trying to add messages to an azure service bus from a notebook. But I get error from title. Any suggestions how to solve this?import asynciofrom azure.servicebus.aio import ServiceBusClientfrom azure.servicebus import ServiceBusMessagefrom azure...

  • 3078 Views
  • 1 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@Sega2 it sounds like the error occurs because the typing_extensions library version in your Databricks environment is outdated and does not include the Buffer class, which is being imported by one of the Azure libraries. Can you first try: %pip inst...

  • 0 kudos
Labels