cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Gilg
by Contributor II
  • 3941 Views
  • 1 replies
  • 0 kudos

DLT: Waiting for resources took a long time

Hi Team,I have a DLT pipeline running in Production for quite some time now. When I check the pipeline, a couple of jobs took longer than expected. Usually, 1 job only took 10-15 minutes to complete with 2 to 3 mins to provision a resource. Then I ha...

Gilg_0-1696540251644.png
  • 3941 Views
  • 1 replies
  • 0 kudos
Latest Reply
speaker_city
New Contributor II
  • 0 kudos

I am currently trying projects from dbdemos [Full Delta Live Tables Pipeline - Loan].I keep running into this error. how do I resolve this?

  • 0 kudos
Saf4Databricks
by New Contributor III
  • 901 Views
  • 2 replies
  • 1 kudos

Resolved! Testing PySpark - Document links broken

The top paragraph of this Testing PySpark page from Apache Spark team states the following - where it points to some links with title 'see here'. But no link is provided to click on. Can someone please provide those links the document is referring to...

  • 901 Views
  • 2 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @Saf4Databricks ,Sure, here they are:- To view the docs for PySpark test utils, see here. spark.apache.org- To see the code for PySpark built-in test utils, check out the Spark repositorypyspark.testing.utils — PySpark 3.5.2 documentation (apache....

  • 1 kudos
1 More Replies
hanish
by New Contributor II
  • 3952 Views
  • 5 replies
  • 2 kudos

Job cluster support in jobs/runs/submit API

We are using jobs/runs/submit API of databricks to create and trigger a one-time run with new_cluster and existing_cluster configuration. We would like to check if there is provision to pass "job_clusters" in this API to reuse the same cluster across...

  • 3952 Views
  • 5 replies
  • 2 kudos
Latest Reply
Nagrjuna
New Contributor II
  • 2 kudos

Hi, Any update on the above mentioned issue? Unable to submit a one time new job run (api/2.0 or 21/jobs/runs/submit) with shared job cluster or one new cluster has to be used for all TASKs in the job 

  • 2 kudos
4 More Replies
sakuraDev
by New Contributor II
  • 680 Views
  • 1 replies
  • 1 kudos

Resolved! schema is not enforced when using autoloader

Hi everyone,I am currently trying to enforce the following schema:  StructType([ StructField("site", StringType(), True), StructField("meter", StringType(), True), StructField("device_time", StringType(), True), StructField("data", St...

sakuraDev_0-1725389159389.png
  • 680 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @sakuraDev ,I'm afraid your assumption is wrong. Here you define data field as struct type and the result is as expected. So once you have this column as struct type, you can refer to nested object using dot notation. So if you would like to get e...

  • 1 kudos
anirudh286
by New Contributor
  • 1260 Views
  • 2 replies
  • 0 kudos

Info on Databricks AWS High Availability during zone selection

Hi Team,During the zone selection in the Databricks environment, there is an option for High Availability (HA), which selects instances from other zones to ensure prolonged uptimes.My question is: Does the HA option only select instances from other a...

  • 1260 Views
  • 2 replies
  • 0 kudos
Latest Reply
fredy-herrera
New Contributor II
  • 0 kudos

NO it is not

  • 0 kudos
1 More Replies
delson
by New Contributor II
  • 858 Views
  • 4 replies
  • 0 kudos

Data Ingestion from GCP

Hi,I'm ingesting data from GCP to Databricks, and I think I've noticed a bug in that any datatables that have a numerical starting character are not ingested at all.Has anyone else experienced this?Please let me know if there is a way around this apa...

  • 858 Views
  • 4 replies
  • 0 kudos
Latest Reply
delson
New Contributor II
  • 0 kudos

Hi Slash, thanks for getting back to meSo for instance - I have data tables such as "20240901_demographics_data_v1" which I'm trying to move from BQOther data tables that don't include a date (or other numerical characters) at the front are being ing...

  • 0 kudos
3 More Replies
drag7ter
by Contributor
  • 4264 Views
  • 4 replies
  • 1 kudos

Resolved! Not able to set run_as service_principal_name

I'm trying to run: databricks bundle deploy -t prod --profile PROD_Service_Principal My bundle looks: bundle: name: myproject include: - resources/jobs/bundles/*.yml targets: # The 'dev' target, for development purposes. This target is the de...

  • 4264 Views
  • 4 replies
  • 1 kudos
Latest Reply
reidwil
New Contributor II
  • 1 kudos

Building on this situation, I am seeing if I deploy a job using a service principal this way, I am getting something prepended to the job like `[dev f46583c2_8c9e_499f_8d41_823332bfd4473] `. Is there a different way for me via bundling to change this...

  • 1 kudos
3 More Replies
anshi_t_k
by New Contributor III
  • 1555 Views
  • 3 replies
  • 1 kudos

Data engineering professional exam

Each configuration below is identical in that each cluster has 400 GB total of RAM, 160 total cores, and only one Executor per VM.Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to...

  • 1555 Views
  • 3 replies
  • 1 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 1 kudos

Hi @anshi_t_k ,The key consideration here is fault tolerance. How do you protect against a VM failure? By having more VMs, as the impact of a single VM  failure will be the lowest.For example answer C - the crash of the VM is loosing 1/1 so 100% capa...

  • 1 kudos
2 More Replies
Stellar
by New Contributor II
  • 5267 Views
  • 1 replies
  • 0 kudos

Databricks CI/CD Azure Devops

Hi all,I am looking for advice on what would be the best approach when it comes to CI/CD in Databricks and repo in general. What would be the best approach; to have main branch and branch off of it or? How will changes be propagated from dev to qa an...

  • 5267 Views
  • 1 replies
  • 0 kudos
Harsha777
by New Contributor III
  • 1718 Views
  • 3 replies
  • 2 kudos

Resolved! Sub-Query behavior in sql statements

Hi Team,I have a query with below construct in my projectSELECT count(*) FROM `catalog`.`schema`.`t_table`WHERE _col_check IN (SELECT DISTINCT _col_check FROM `catalog`.`schema`.`t_check_table`)Actually, there is no column "_col_check" in the sub-que...

  • 1718 Views
  • 3 replies
  • 2 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 2 kudos

Hi @Harsha777 ,What occurs is called column shadowing.What happens is that the column names in main query and sub query  are identica and the databricks engine after not finding it in sub query searches in the main query.The simplest way to avoid the...

  • 2 kudos
2 More Replies
m_weirath
by New Contributor II
  • 971 Views
  • 2 replies
  • 0 kudos

DLT-META requires ddl when using cdc_apply_changes

We are setting up new DLT Pipelines using the DLT-Meta package. Everything is going well in bringing our data in from Landing to our Bronze layer when we keep the onboarding JSON fairly vanilla. However, we are hitting an issue when using the cdc_app...

  • 971 Views
  • 2 replies
  • 0 kudos
Latest Reply
dbuser17
New Contributor II
  • 0 kudos

Please check these details: https://github.com/databrickslabs/dlt-meta/issues/90

  • 0 kudos
1 More Replies
Vasu_Kumar_T
by New Contributor II
  • 625 Views
  • 1 replies
  • 0 kudos

Unity Catalog: Metastore 3 level Hierarchy

I have data files categorized by application and region. Want to know the best way to load them into the Bronze and Silver layers while maintaining proper segregation.For example, in our landing zone, we have a structure of raw files to be loaded usi...

  • 625 Views
  • 1 replies
  • 0 kudos
Latest Reply
Shazaamzaa
New Contributor III
  • 0 kudos

If I understand it correctly, you have source files partitioned by application and region in cloud storage that you want to load and would like some suggestions on the Unity Catalog structure. It will definitely depend on how you want the data to be ...

  • 0 kudos
Sudheer89
by New Contributor
  • 1892 Views
  • 1 replies
  • 0 kudos

Where is Data tab and DBFS in Premium Databricks workspace

Currently I can see Catalog tab instead of Data tab in left side navigation. I am unable to find Data tab -> File browser where I would like to upload one sample orders csv file. Later I want to refer that path in Databricks notebooks as /FileStore/t...

  • 1892 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @Sudheer89 ,By default - DBFS tab is disabled. As an admin user, you can manage your users’ ability to browse data in the Databricks File System (DBFS) using the visual browser interface.Go to the admin console.Click the Workspace Settings tab.In ...

  • 0 kudos
valesexp
by New Contributor II
  • 1787 Views
  • 1 replies
  • 1 kudos

Enforce tags to Jobs

Anyone know how I enforce jobs tags, not the custom tags for cluster. I want to enforce that jobs has certain tags so we can filter our jobs. We are not using Unity Catalog yet. 

  • 1787 Views
  • 1 replies
  • 1 kudos
Latest Reply
Walter_C
Databricks Employee
  • 1 kudos

Currently, enforcing job tags is not a built-in feature in Databricks. However, you can add tags to your jobs when creating or updating them and filter jobs by these tags on the jobs list page.

  • 1 kudos
Nathant93
by New Contributor III
  • 1612 Views
  • 1 replies
  • 0 kudos

Constructor public org.apache.spark.ml.feature.Bucketizer(java.lang.String) is not whitelisted.

Hi,I am getting the error Constructor public org.apache.spark.ml.feature.Bucketizer(java.lang.String) is not whitelisted. when using a serverless compute cluster. I have seen in some other articles that this is due to high concurrency - does anyone k...

  • 1612 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

The error you're encountering, "Constructor public org.apache.spark.ml.feature.Bucketizer(java.lang.String) is not whitelisted", typically arises when using a shared mode cluster. This is because Spark ML is not supported in shared clusters due to se...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels