cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

LauJohansson
by Contributor
  • 2179 Views
  • 3 replies
  • 3 kudos

Resolved! Delta live table: Retrieve CDF columns

I have want to use the apply_changes feature from a bronze table to a silver table.The bronze table have no "natural" sequence_by column. Therefore, I want to use the CDF column "_commit_timestamp" as the sequence_by.How do I retrieve the columns in ...

  • 2179 Views
  • 3 replies
  • 3 kudos
Latest Reply
LauJohansson
Contributor
  • 3 kudos

Thank you @raphaelblg!I chose to write an article on the subject after this discussion: https://www.linkedin.com/pulse/databricks-delta-live-tables-merging-lau-johansson-cdtce/?trackingId=L872gj0yQouXgJudM75gdw%3D%3D

  • 3 kudos
2 More Replies
BillMarshall
by New Contributor
  • 3725 Views
  • 2 replies
  • 0 kudos

workflow permissions errors

I have a notebook that outputs an Excel file. Through trial and error, and after consulting with various forums I discovered  the .xlsx file needed to be written to a temp file and then copied to the volume in Unity Catalog.When I run the notebook by...

  • 3725 Views
  • 2 replies
  • 0 kudos
Latest Reply
emora
New Contributor III
  • 0 kudos

Hello, yes of course you need to write the excel file in the tmp folder, but then you can move it to whatever you want without problem. In my current project we implemented this method to create the file in the tmp folder, and then move it to one spe...

  • 0 kudos
1 More Replies
Subhasis
by New Contributor III
  • 2319 Views
  • 5 replies
  • 0 kudos

Autoloader Checkpoint Fails and then the after changing the checkpoint path need to reload all data

Autoloader Checkpoint Fails and then the after changing the checkpoint path need to reload all data. I want to load all the data which are not processed . I don't want to relaod all the data.

  • 2319 Views
  • 5 replies
  • 0 kudos
Latest Reply
Subhasis
New Contributor III
  • 0 kudos

Do checkpoint has some benchmark capacity after that it stops writing data? 

  • 0 kudos
4 More Replies
SowmyaDesai
by New Contributor II
  • 2340 Views
  • 3 replies
  • 2 kudos

Run pyspark queries from outside databricks

I have written a Notebook which would execute pyspark query. I then execute it remotely from outside databricks environment using /api/2.1/jobs/run-now, which would then run the notebook. I also want to retrieve the results from this job execution. H...

  • 2340 Views
  • 3 replies
  • 2 kudos
Latest Reply
SowmyaDesai
New Contributor II
  • 2 kudos

Thanks for responding. I did go through this link. It talks about executing on SQL warehouse though. Is there a way we can execute queries on Databricks clusters instead?Databricks has this connector for SQL https://docs.databricks.com/en/dev-tools/p...

  • 2 kudos
2 More Replies
FrancisApel
by New Contributor II
  • 9594 Views
  • 4 replies
  • 0 kudos

[TASK_WRITE_FAILED] Task failed while writing rows to abfss

I am trying to insert into an already created delta table in Unity Catalog. I am getting the error:[TASK_WRITE_FAILED] Task failed while writing rows to abfss://xxxx@xxxxxxxxxxxxxxxx.dfs.core.windows.net/__unitystorage/catalogs/xxxxxxxx-c6c8-45d8-ac3...

  • 9594 Views
  • 4 replies
  • 0 kudos
Latest Reply
NikunjKakadiya
New Contributor II
  • 0 kudos

Any chance this issue got resolved?I am also seeing the same error when I am trying to incrementally read the system tables using the read stream method and writing it using the writestream method. This generally comes for the audit table but other t...

  • 0 kudos
3 More Replies
Gilg
by Contributor II
  • 5010 Views
  • 1 replies
  • 0 kudos

DLT: Waiting for resources took a long time

Hi Team,I have a DLT pipeline running in Production for quite some time now. When I check the pipeline, a couple of jobs took longer than expected. Usually, 1 job only took 10-15 minutes to complete with 2 to 3 mins to provision a resource. Then I ha...

Gilg_0-1696540251644.png
  • 5010 Views
  • 1 replies
  • 0 kudos
Latest Reply
speaker_city
New Contributor II
  • 0 kudos

I am currently trying projects from dbdemos [Full Delta Live Tables Pipeline - Loan].I keep running into this error. how do I resolve this?

  • 0 kudos
Saf4Databricks
by New Contributor III
  • 1541 Views
  • 2 replies
  • 1 kudos

Resolved! Testing PySpark - Document links broken

The top paragraph of this Testing PySpark page from Apache Spark team states the following - where it points to some links with title 'see here'. But no link is provided to click on. Can someone please provide those links the document is referring to...

  • 1541 Views
  • 2 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @Saf4Databricks ,Sure, here they are:- To view the docs for PySpark test utils, see here. spark.apache.org- To see the code for PySpark built-in test utils, check out the Spark repositorypyspark.testing.utils — PySpark 3.5.2 documentation (apache....

  • 1 kudos
1 More Replies
hanish
by New Contributor II
  • 5372 Views
  • 5 replies
  • 2 kudos

Job cluster support in jobs/runs/submit API

We are using jobs/runs/submit API of databricks to create and trigger a one-time run with new_cluster and existing_cluster configuration. We would like to check if there is provision to pass "job_clusters" in this API to reuse the same cluster across...

  • 5372 Views
  • 5 replies
  • 2 kudos
Latest Reply
Nagrjuna
New Contributor II
  • 2 kudos

Hi, Any update on the above mentioned issue? Unable to submit a one time new job run (api/2.0 or 21/jobs/runs/submit) with shared job cluster or one new cluster has to be used for all TASKs in the job 

  • 2 kudos
4 More Replies
sakuraDev
by New Contributor II
  • 1960 Views
  • 1 replies
  • 1 kudos

Resolved! schema is not enforced when using autoloader

Hi everyone,I am currently trying to enforce the following schema:  StructType([ StructField("site", StringType(), True), StructField("meter", StringType(), True), StructField("device_time", StringType(), True), StructField("data", St...

sakuraDev_0-1725389159389.png
  • 1960 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @sakuraDev ,I'm afraid your assumption is wrong. Here you define data field as struct type and the result is as expected. So once you have this column as struct type, you can refer to nested object using dot notation. So if you would like to get e...

  • 1 kudos
anirudh286
by New Contributor
  • 2079 Views
  • 2 replies
  • 0 kudos

Info on Databricks AWS High Availability during zone selection

Hi Team,During the zone selection in the Databricks environment, there is an option for High Availability (HA), which selects instances from other zones to ensure prolonged uptimes.My question is: Does the HA option only select instances from other a...

  • 2079 Views
  • 2 replies
  • 0 kudos
Latest Reply
fredy-herrera
New Contributor II
  • 0 kudos

NO it is not

  • 0 kudos
1 More Replies
delson
by New Contributor II
  • 1702 Views
  • 4 replies
  • 0 kudos

Data Ingestion from GCP

Hi,I'm ingesting data from GCP to Databricks, and I think I've noticed a bug in that any datatables that have a numerical starting character are not ingested at all.Has anyone else experienced this?Please let me know if there is a way around this apa...

  • 1702 Views
  • 4 replies
  • 0 kudos
Latest Reply
delson
New Contributor II
  • 0 kudos

Hi Slash, thanks for getting back to meSo for instance - I have data tables such as "20240901_demographics_data_v1" which I'm trying to move from BQOther data tables that don't include a date (or other numerical characters) at the front are being ing...

  • 0 kudos
3 More Replies
drag7ter
by Contributor
  • 6118 Views
  • 4 replies
  • 1 kudos

Resolved! Not able to set run_as service_principal_name

I'm trying to run: databricks bundle deploy -t prod --profile PROD_Service_Principal My bundle looks: bundle: name: myproject include: - resources/jobs/bundles/*.yml targets: # The 'dev' target, for development purposes. This target is the de...

  • 6118 Views
  • 4 replies
  • 1 kudos
Latest Reply
reidwil
New Contributor II
  • 1 kudos

Building on this situation, I am seeing if I deploy a job using a service principal this way, I am getting something prepended to the job like `[dev f46583c2_8c9e_499f_8d41_823332bfd4473] `. Is there a different way for me via bundling to change this...

  • 1 kudos
3 More Replies
anshi_t_k
by New Contributor III
  • 2842 Views
  • 3 replies
  • 1 kudos

Data engineering professional exam

Each configuration below is identical in that each cluster has 400 GB total of RAM, 160 total cores, and only one Executor per VM.Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to...

  • 2842 Views
  • 3 replies
  • 1 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 1 kudos

Hi @anshi_t_k ,The key consideration here is fault tolerance. How do you protect against a VM failure? By having more VMs, as the impact of a single VM  failure will be the lowest.For example answer C - the crash of the VM is loosing 1/1 so 100% capa...

  • 1 kudos
2 More Replies
Stellar
by New Contributor II
  • 6030 Views
  • 1 replies
  • 0 kudos

Databricks CI/CD Azure Devops

Hi all,I am looking for advice on what would be the best approach when it comes to CI/CD in Databricks and repo in general. What would be the best approach; to have main branch and branch off of it or? How will changes be propagated from dev to qa an...

  • 6030 Views
  • 1 replies
  • 0 kudos
Harsha777
by New Contributor III
  • 3034 Views
  • 3 replies
  • 2 kudos

Resolved! Sub-Query behavior in sql statements

Hi Team,I have a query with below construct in my projectSELECT count(*) FROM `catalog`.`schema`.`t_table`WHERE _col_check IN (SELECT DISTINCT _col_check FROM `catalog`.`schema`.`t_check_table`)Actually, there is no column "_col_check" in the sub-que...

  • 3034 Views
  • 3 replies
  • 2 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 2 kudos

Hi @Harsha777 ,What occurs is called column shadowing.What happens is that the column names in main query and sub query  are identica and the databricks engine after not finding it in sub query searches in the main query.The simplest way to avoid the...

  • 2 kudos
2 More Replies
Labels