cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

abhijitnag
by New Contributor II
  • 864 Views
  • 2 replies
  • 0 kudos

Materialize View creation not supported from DLT Pipeline

Hi Team, I have a very basic scenario where I am using my custom catalog and want materialize view to get created from DLT table at the end of pipeline. The SQL used as below for the same.where "loom_data_transform" is a Streaming table. But pipeline...

abhijitnag_1-1704047659613.png abhijitnag_0-1704047565470.png
Data Engineering
Delta Live Table
dlt
Unity Catalog
  • 864 Views
  • 2 replies
  • 0 kudos
Latest Reply
warsamebashir
New Contributor II
  • 0 kudos

Hey @abhijitnag are you sure your loom_data_transform was created as a STREAMING table? docs:https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table.html    

  • 0 kudos
1 More Replies
naveenprasanth
by New Contributor
  • 1259 Views
  • 1 replies
  • 1 kudos

Issue with Reading MongoDB Data in Unity Catalog Cluster

I am encountering an issue while trying to read data from MongoDB in a Unity Catalog Cluster using PySpark. I have shared my code below: from pyspark.sql import SparkSession database = "cloud" collection = "data" Scope = "XXXXXXXX" Key = "XXXXXX-YYY...

Data Engineering
mongodb
spark config
Spark Connector package
Unity Catalog
  • 1259 Views
  • 1 replies
  • 1 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 1 kudos

Few points 1. chce if you installed exactly same driver version as you are pointing this in code (2.12:3.2.0) it has to match 100percentorg.mongodb.spark:mongo-spark-connector_2.12:3.2.02. I have seen people configuring  connction to atlas in two way...

  • 1 kudos
dzmitry_tt
by New Contributor
  • 1160 Views
  • 1 replies
  • 0 kudos

DeltaRuntimeException: Keeping the source of the MERGE statement materialized has failed repeatedly.

I'm using Autoloader (in Azure Databricks) to read parquet files and write their data into the Delta table.schemaEvolutionMode is set to 'rescue'.In foreach_batch I do1) Transform of read dataframe;2) Create temp view based on read dataframe and merg...

Data Engineering
autoloader
MERGE
streaming
  • 1160 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

Hmm, you can't have duplicated data in source dataframe/batch but it should error out with diffrent erro like "Cannot perform Merge as multiple source rows matched and attempted to modify the same target row...".Also this behaviour after rerun is str...

  • 0 kudos
EDDatabricks
by Contributor
  • 1203 Views
  • 1 replies
  • 0 kudos

Slow stream static join in Spark Structured Streaming

SituationRecords are streamed from an input Delta table via a Spark Structured Streaming job. The streaming job performs the following.Read from input Delta table (readStream)Static join on small JSONStatic join on big Delta tableWrite to three Delta...

EDDatabricks_1-1703760391974.png
Data Engineering
Azure Databricks
optimization
Spark Structured Streaming
Stream static join
  • 1203 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

You have quite small machines that you are using, please take into consideration that a lot of memory of machine is occupied by other processes https://kb.databricks.com/clusters/spark-shows-less-memoryThis is not good idea to broadcast huge data fra...

  • 0 kudos
Erik
by Valued Contributor II
  • 8378 Views
  • 6 replies
  • 3 kudos

Resolved! How to run code-formating on the notebooks

Has anyone found a nice way to run code-formating (like black) on the notebooks **in the workspace**? My current workflow is to commit the file, pull it locally, format, repush and pull. It would be nice if it was some relatively easy way to run blac...

  • 8378 Views
  • 6 replies
  • 3 kudos
Latest Reply
MartinPlay01
New Contributor II
  • 3 kudos

Hi Erik,I don't know if you are aware of this feature, currently there is an option to format the code in your databricks notebooks using the black code style formatter.Just you need to either have a version of your DBR equal to or greater than 11.2 ...

  • 3 kudos
5 More Replies
XClar_40456
by New Contributor
  • 1343 Views
  • 2 replies
  • 1 kudos

Resolved! Are there system tables that are customer accessible for setting up job run health monitoring in GCP Databricks?

Is Overwatch still an active project, is there anything equivalent for GCP Databricks or any plans for Overwatch to be available in GCP? 

  • 1343 Views
  • 2 replies
  • 1 kudos
Latest Reply
SriramMohanty
New Contributor III
  • 1 kudos

Yes overwatch supports GCP.

  • 1 kudos
1 More Replies
rt-slowth
by Contributor
  • 371 Views
  • 0 replies
  • 0 kudos

Help design my streaming pipeline

###Data Source- AWS RDS- Database migration tasks have been created using AWS DMS- Relevant cdc information is being stored in a specific bucket in S3### Data frequency- Once a day (but not sure when, sometime after 6pm)### Development environment- d...

  • 371 Views
  • 0 replies
  • 0 kudos
RabahO
by New Contributor III
  • 1292 Views
  • 1 replies
  • 0 kudos

Handling data close to SCD2 with Delta tables

Hello, stack used: pyspark and delta tablesI'm working with some data that look a bit like SCD2 data.Basically, the data has columns that represent an id, a rank column and other informations, here's an example:login, email, business_timestamp => the...

  • 1292 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

Your problem is exactly like SCD2 . You just add one more column with valid to date ( optionals you can add flag is actual to tag current records)You can use DLT apply changes syntax. Alternatively Merge statement .On the top of that table you can bu...

  • 0 kudos
lorenz
by New Contributor III
  • 6140 Views
  • 3 replies
  • 1 kudos

Resolved! Databricks approaches to CDC

I'm interested in learning more about Change Data Capture (CDC) approaches with Databricks. Can anyone provide insights on the best practices and recommendations for utilizing CDC effectively in Databricks? Are there any specific connectors or tools ...

  • 6140 Views
  • 3 replies
  • 1 kudos
Latest Reply
jcozar
Contributor
  • 1 kudos

Hi, first of all thank you all in advance! I am very interested on this topic!My question is beyond what it is described here. As well as @Pektas , I am using debezium to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question...

  • 1 kudos
2 More Replies
Aidin
by New Contributor II
  • 3829 Views
  • 4 replies
  • 0 kudos

BINARY data type

Hello everyone.I'm trying to understand how BINARY data type works in spark sql. According to examples in the documentation, using cast or literal 'X' should return HEX representation of the binary data type, but when I try the same code, I see base6...

Screenshot 2023-12-22 at 19.55.15.png
  • 3829 Views
  • 4 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

If you are confused , please look at this thread, they explain that Databricks use base64 as binary default. This is not documented but can be tracked at source code level.https://stackoverflow.com/questions/75753311/not-getting-binary-value-in-datab...

  • 0 kudos
3 More Replies
sahesh1320
by New Contributor
  • 467 Views
  • 1 replies
  • 0 kudos

Shutdown Cluster in script if there is any failure

I am working on incremental load from sql server to Delta lake tables stored in ADLS gen2. DUring the script i need to qrite a logic toShut down the DB cluster on failure (there needs to be logging added to ensure that shutdown happens promptly to pr...

  • 467 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

If you run your notebook via workflow and error happen and there are no retires on job, then job cluster will be terminated immidietly after failure.You can add python block of try catch and if error occurs , you catch the error and log somewhere bef...

  • 0 kudos
dbx-user7354
by New Contributor III
  • 666 Views
  • 1 replies
  • 0 kudos

Remove description from job

How do I remove a description from a job completely? When I try to just remove the text in the edit window, the same text shows up afterwards, even though it says "Successfully updated job". Also I had to write this twice, because on the first try I ...

  • 666 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

Hi,this is not possible form UI  You have to replace content with e.g. white space. I think this is bug.But you can do it using job api !Below example in PowerShell, just reaplce:job_istokenworkspaceURL$body = @' { "job_id": 123456789, "new_setti...

  • 0 kudos
ksenija
by Contributor
  • 2937 Views
  • 5 replies
  • 5 kudos

How to change cluster size using a script

I want to change instance type or number of max workers via a python script. Does anyone know how to do it/is it possible? I have a lot of background jobs when I want to scale down my workers, so autoscaling is not an option. I was getting an error t...

  • 2937 Views
  • 5 replies
  • 5 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 5 kudos

Hi ksenija, this is just my guess but maybe you are using Cluster Policy in your cluster that only allows you to use specific cluster size ? E.g. below cluster policy that limits to some cluster sizes only.  

  • 5 kudos
4 More Replies
SamGreene
by Contributor
  • 3161 Views
  • 6 replies
  • 3 kudos

Change DLT table type from streaming to 'normal'

I have a DLT streaming live table, and after watching a QA session, I saw that it is advised to only use streaming tables for your raw landing.  I attempted to modify my pipeline to have my silver table be a regular LIVE TABLE, but an error was throw...

  • 3161 Views
  • 6 replies
  • 3 kudos
Latest Reply
quakenbush
Contributor
  • 3 kudos

Just curious, could you point me to said QA session if it's a video or something? I'm not aware of such a limitation. You can use DLT's live streaming tables anywhere in the Medallion architecture, just make sure not to break stream composability by ...

  • 3 kudos
5 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels