cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

alesventus
by Contributor
  • 1153 Views
  • 2 replies
  • 0 kudos

Resolved! How to handle load of 300 tables to delta lake

My task is to sync 300 tables from on prem sql server to delta lake. I will load CDC from Raw. First step is to move CDC data to bronze with autoloader. Then using delta stream get changes from bronze, make simple datatype changes and merge this data...

  • 1153 Views
  • 2 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @alesventus ,You can apply metadata/config driven approch. You can create control table (or json/yaml file) with all information that are required for processing like:- table name- target table- table primary keys- transformation to applyAnd then ...

  • 0 kudos
1 More Replies
ahmed_zarar
by New Contributor III
  • 1753 Views
  • 2 replies
  • 3 kudos

Resolved! Process single data set with different JSON schema rows using Pyspark in databricks

 Hi,i am getting data from event hub and stored in delta table as a row table, i data i received in json , the problem i data i have different schema in each row but i code i use it take first row a json schema i am stuck how to do please any one gui...

ahmed_zarar_0-1722683168135.png
  • 1753 Views
  • 2 replies
  • 3 kudos
Latest Reply
ahmed_zarar
New Contributor III
  • 3 kudos

Thank you , I got it.

  • 3 kudos
1 More Replies
hpant
by New Contributor III
  • 3316 Views
  • 9 replies
  • 7 kudos

Resolved! Where exactly I should create Volume in a catalog?

Currently my Databricks looks like this: I want to create volume to access external location. Where exactly should I create it? Should a create new schema in "poe" catalog and create a volume inside it or create it in a existing schema? What is the b...

hpant_0-1722505474676.png
  • 3316 Views
  • 9 replies
  • 7 kudos
Latest Reply
hpant1
New Contributor III
  • 7 kudos

No, I don't have.  

  • 7 kudos
8 More Replies
juanicobsider
by New Contributor
  • 1012 Views
  • 2 replies
  • 3 kudos

How to parse VARIANT type column using Pyspark sintax?

I trying to parse VARIANT data type column, what is the correct sintax to parse sub columns using Pyspark, is it possible?.I'd like to know how to do it this way (I know how to do it using SQL syntax).   

juanicobsider_0-1722907722976.png juanicobsider_1-1722907840323.png juanicobsider_2-1722907947212.png
  • 1012 Views
  • 2 replies
  • 3 kudos
Latest Reply
Witold
Honored Contributor
  • 3 kudos

As an addition to what @szymon_dybczak already said correctly. It's actually not a workaround, it's designed and documented that way. Make sure that you understand the difference between `:`, and `.`.Regarding PySpark, the API has other variant relat...

  • 3 kudos
1 More Replies
tramtran
by Contributor
  • 3563 Views
  • 6 replies
  • 7 kudos

Make the job fail if a task fail

Hi everyone,I have a job with 2 tasks running independently. If one of them fails, the remaining task continues to run. I would like the job to fail if any task fails.Is there any way to do that?Thank you!

  • 3563 Views
  • 6 replies
  • 7 kudos
Latest Reply
Edthehead
Contributor III
  • 7 kudos

Extending to what @mhiltner has suggested, let's  say you have 2 streaming tasks streamA and streamB. Create 2 separate tasks taskA and taskB. Each of these tasks should execute the same notebook which makes an API call to the CANCEL RUN or CANCEL AL...

  • 7 kudos
5 More Replies
DanR
by New Contributor III
  • 18452 Views
  • 4 replies
  • 3 kudos

PermissionError: [Errno 1] Operation not permitted: '/Volumes/mycatalog'

We are having intermittent errors where a Job Task cannot access a Catalog through a Volume, with the error: `PermissionError: [Errno 1] Operation not permitted: '/Volumes/mycatalog'`.The Job has 40 tasks running in parallel and every few runs we exp...

Data Engineering
Unity Catalog
Volumes
  • 18452 Views
  • 4 replies
  • 3 kudos
Latest Reply
NandiniN
Databricks Employee
  • 3 kudos

It appears to be a concurrency limitation, and there were fixes in the past but there is a possibility it may be a new code flow, adding a retry to the operation can mitigate the issue and work as a workaround. But you can report the issue with Datab...

  • 3 kudos
3 More Replies
delta_bravo
by New Contributor
  • 7284 Views
  • 2 replies
  • 0 kudos

Cluster termination issue

I am using Databricks as a Community Edition user with a limited cluster (just 1 Driver: 15.3 GB Memory, 2 Cores, 1 DBU). I am trying to run some custom algorithms for continuous calculations and writing results to the delta table every 15 minutes al...

  • 7284 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

If you set the "Terminate after" setting to 0 minutes during the creation of an all-purpose compute, it means that the auto-termination feature will be turned off. This is because the "Terminate after" setting is used to specify an inactivity period ...

  • 0 kudos
1 More Replies
curiousoctopus
by New Contributor III
  • 4550 Views
  • 4 replies
  • 4 kudos

Run multiple jobs with different source code at the same time with Databricks asset bundles

Hi,I am migrating from dbx to databricks asset bundles. Previously with dbx I could work on different features in separate branches and launch jobs without issue of one job overwritting the other. Now with databricks asset bundles it seems like I can...

  • 4550 Views
  • 4 replies
  • 4 kudos
Latest Reply
mo_moattar
New Contributor III
  • 4 kudos

We have the same issue. We might have multiple open PR on the bundles that are deploying the code, pipelines, jobs, etc. to the same workspace before the merge and they keep overwriting each other in the workspace.The jobs already have a separate ID ...

  • 4 kudos
3 More Replies
narenderkumar53
by New Contributor II
  • 1062 Views
  • 3 replies
  • 2 kudos

can we parameterize the tags in the job compute

I want to monitor the cost better for the databricks job computes.I am using tags in the cluster to monitor cost.The tag values is static as of now.can we parameterize the compute the job cluster so that I can pass the tag values during the runtime a...

  • 1062 Views
  • 3 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @,If you're using ADF you can look at below article:Applying Dynamic Tags To Databricks Job Clusters in Azure Data Factory | by Kyle Hale | MediumIf not, I think you can try to write some code that will use below endpoint. The idea is, before exec...

  • 2 kudos
2 More Replies
Jeewan
by New Contributor
  • 555 Views
  • 0 replies
  • 0 kudos

Partition In Spark with subqeury which include Union

I have a SQL query like this:select ... from table1 where id in (slect id from table 1 where (some condition) UNION select id from table2 where (some condition)) table1I have made a partition of 200 where upper bound is 200 and lower bound is 0 and p...

  • 555 Views
  • 0 replies
  • 0 kudos
Prashanth24
by New Contributor III
  • 1672 Views
  • 3 replies
  • 3 kudos

Resolved! Databricks workflow each task cost

Suppose if we have 4 tasks (3 notebooks and 1 normal python code) in a workflow then i would like to know the cost incurred for each task in the Databricks workflow. Please let me know the any way to find out this details.

  • 1672 Views
  • 3 replies
  • 3 kudos
Latest Reply
Edthehead
Contributor III
  • 3 kudos

If each of the tasks are sharing the same cluster then no, you cannot differentiate the costs between the tasks.  However, if you setup each task to have its own job cluster, then pass some custom tags and you can then differentiate/report the costs ...

  • 3 kudos
2 More Replies
guangyi
by Contributor III
  • 593 Views
  • 0 replies
  • 0 kudos

Confuse about large memory usage of cluster

We set up a demo DLT pipeline with no data involved:  @Dlt.table( name="demo" ) def sample(): df = spark.sql("SELECT 'silver' as Layer") return df However, when we check the metric of the cluster, it looks like 10GB memory has already be...

  • 593 Views
  • 0 replies
  • 0 kudos
DBMIVEN
by New Contributor II
  • 682 Views
  • 0 replies
  • 0 kudos

Ingesting data from SQL Server foreign tables

I have created a connection to a SQL server DB, and set up a catalog for it. i can now view all the tables, and query them. I want to ingest some of the tables into our ADLS gen 2 that we set up with Unity Catalog. What is the best approach here? Lak...

Data Engineering
Data ingestion
Foreign catalogs
Incremental Data Ingestion
LakeFlow
SQL Server
  • 682 Views
  • 0 replies
  • 0 kudos
ayush19
by New Contributor III
  • 1043 Views
  • 1 replies
  • 0 kudos

Running jar on Databricks cluster from Airflow

Hello,I have a jar file which is installed on a cluster. I need to run this jar from Airflow using DatabricksSubmitRunOperator. I followed the standard instructions as available on Airflow docshttps://airflow.apache.org/docs/apache-airflow-providers-...

ayush19_0-1722491889219.png ayush19_1-1722491926724.png ayush19_2-1722491964523.png ayush19_3-1722492023707.png
  • 1043 Views
  • 1 replies
  • 0 kudos
ruoyuqian
by New Contributor II
  • 1546 Views
  • 0 replies
  • 0 kudos

dbt writting into different schema

I have a unity catalog and it goes like `catalogname.schemaname1`& `catalogname.schemaname2`. and I am trying to write tables into schemaname2 with dbt, the current setup in the dbt profiles.yml is   prj_dbt_databricks: outputs: dev: cata...

  • 1546 Views
  • 0 replies
  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels