Why is Dlt pipeline processing streaming data so slow?
Running a single table is fast, but running 80 tables at the same time takes a long time, is it serial queued execution? Isn't it concurrent?
- 1091 Views
- 0 replies
- 0 kudos
Running a single table is fast, but running 80 tables at the same time takes a long time, is it serial queued execution? Isn't it concurrent?
Hi,I am trying to use a UDF to get the last day of the month and use the boolean result of the function in an insert command. Please find herewith the function and the my query.function:import calendarfrom datetime import datetime, date, timedeltadef...
Can anyone explain in layman what is difference between Streaming and streaming live table ?
Streaming, in a broad sense, refers to the continuous flow of data over a network. It allows you to watch or listen to content in real-time without having to download the entire file first. A "Streaming Live Table" might refer to a specific type of ...
I am creating a data frame by reading a table's data residing in Azure backed unity catalog. I need to write the df or file to GCS bucket. I have configured the spark cluster config using the GCP service account json values.on running : df1.write.for...
Hi, is there any terraform resource to apply this GRANT or this have to be done always manually?
My job was not able to start because I got this problem in the job cluster.This job is running on a Azure Databricks workspace that has been deployed for almost a year and I have not had this error before. It is deployed in North Europe.After getting...
We have the same problem randomly occurring since yesterday in two workspaces.The cluster started fine today in the morning at 08:00, but failed again from around 09:00 on.
Hi,absolute Databricks noob here, but I'm trying to set up a DLT pipeline that processes cdc records from an external sql server instance to create a mirrored table in my databricks delta lakehouse. For this, I need to do some initial one-time backfi...
So since nobody responded, I decided to try my own suggestion and hack the snapshot data into the table that gathers the change data capture. After some straying I ended up with the notebook as attached.The notebook first creates 2 dlt tables (lookup...
Yes it does, !!https://databricks.com/session/secured-kerberos-based-spark-notebook-for-data-science
I have the code captured below in the screenshot. When I run this individually it works just fine, when I JOB runs this it fails out with 'ResourceNotFound' - not sure what the issue is... - Checked 'main' branch, which is where this job is pulling f...
Figured it out:ecw_staging_nb_List = ['nb_UPSERT_stg_ecw_insurance','nb_UPSERT_stg_ecw_facilitygroups']Works just fine.
I have been able to perform a selective overwrite using replace Where to a hive_metastore table, but when I use the same code for the same table in a unity catalog, no data is written.Has anyone else had this issue or is there common mistakes that ar...
I have two github repo configured in Databricks Repos folder. repo_1 is run using a job and repo_2 is run/called from repo_1 using Dbutils.notebook.run command. dbutils.notebook.run("/Repos/repo_2/notebooks/notebook", 0, args)i am getting the follo...
I am having a similar issue... ecw_staging_nb_List = ['/Workspace/Repos/PRIMARY/UVVC_DATABRICKS_EDW/silver/nb_UPSERT_stg_ecw_insurance', '/Repos/PRIMARY/UVVC_DATABRICKS_EDW/silver/nb_UPSERT_stg_ecw_facilitygroups'] Adding workspace d...
We have a table using timestampNtz type for timestamp, which is also a cluster key for this table using liquid clustering. I ran OPTIMIZE <table-name>, it failed with errorUnsupported datatype 'TimestampNTZType' But the failed optmization also broke ...
Hi,I am trying to export the list of users and groups from Unity catalog through databricks workspace but i am seeing only the users/groups created inside the workspace instead of the groups and users coming through scim in unity catalog.How can i ge...
Hello when you refer to the users and groups in Unity Catalog, do you refer to the ones created at the Account Level?If this is the case you need to run the API call at the account level and not workspace level, you can see the API doc for account le...
I'm using AutoLoader to process any new file or update that arrives to my landing area. And then I schedule the job using DB workflows to trigger on file arrival. The issue is that the trigger only executes when new files arrive, not when an exiting ...
I don't think you can effectively achieve your goal. While it's theoretically somewhat possible, Databricks documentation says there is no guarantee for correctness - Auto Loader FAQ | Databricks on AWS
I am trying to read a csv file from storage location using spark.read function. Also, i am explicitly passing the schema to the function. However, the data is not loading in proper column of the dataframe. Following are the code details:from pyspark....
Hi , i would suggest to approach as suggested by Thomaz Rossito,but maybe you can give it as an try like swapping the struct field order like this followingschema = StructType([StructField('DA_RATE', DateType(), True),StructField('CURNCY_F', StringTy...
We run `OPTIMIZE` on our tables every 24 hours as follows:spark.sql(f'OPTIMIZE {catalog_name}.{schema_name}.`{table_name}`;') This morning one of our hourly jobs started failing on the call to `OPTIMIZE` with the error:org.apache.spark.SparkException...
| User | Count |
|---|---|
| 1644 | |
| 793 | |
| 554 | |
| 349 | |
| 287 |