cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

copper-carrot
by New Contributor II
  • 2179 Views
  • 1 replies
  • 1 kudos

spark.sql() is suddenly giving an error "Unable to instantiate org.apache.hadoop.hive.metastore.Hive

spark.sql() is suddenly giving an error "Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient" on databricks jobs and python scripts that worked last month.  No local changes on my end.What could be the cause of this and what sh...

  • 2179 Views
  • 1 replies
  • 1 kudos
neointab
by New Contributor
  • 1024 Views
  • 1 replies
  • 0 kudos

how to restrict group/user cant create unstricted cluster.

we have set up the entitlement,but it doest work, i checked the blogs. it also need the set up in cluster policy. but i dont find how to set up in cluster policy. could you give some suggestions?

  • 1024 Views
  • 1 replies
  • 0 kudos
Latest Reply
antonuzzo96
New Contributor III
  • 0 kudos

Hi, have you checked if users are admins inside the workspace? Because this can greatly change the policies and restrictions on the clusters

  • 0 kudos
hpant1
by New Contributor III
  • 1227 Views
  • 1 replies
  • 0 kudos

Does it make sense to create volume at external location in dev enviroment?

I have create a dev resource group for databricks which includes "storage account", "access connector" and "databricks workspace". In the storage account I have created a container which is linked to the metastore. This container also contain raw dat...

  • 1227 Views
  • 1 replies
  • 0 kudos
Latest Reply
antonuzzo96
New Contributor III
  • 0 kudos

Hei, for some use cases we have created external volumes in Databricks because they needed to access them outside of Databricks and directly on the storage account, as the files had to interact with other tools.

  • 0 kudos
hpant1
by New Contributor III
  • 1029 Views
  • 1 replies
  • 2 kudos

What is more optimized way of writing delta table in a workflow, "append" or "overwrite"?

What is more optimized way of writing delta table in a workflow which is running every hour, "append" or "overwrite"?

  • 1029 Views
  • 1 replies
  • 2 kudos
Latest Reply
Witold
Databricks Partner
  • 2 kudos

There's no "optimized way", as these are two different concepts, and depend on your use case: Overwrite  removes existing data, i.e. replaces it with new data, while append adds new data to your existing table.

  • 2 kudos
ahmed_zarar
by New Contributor III
  • 3097 Views
  • 2 replies
  • 3 kudos

Resolved! Process single data set with different JSON schema rows using Pyspark in databricks

 Hi,i am getting data from event hub and stored in delta table as a row table, i data i received in json , the problem i data i have different schema in each row but i code i use it take first row a json schema i am stuck how to do please any one gui...

ahmed_zarar_0-1722683168135.png
  • 3097 Views
  • 2 replies
  • 3 kudos
Latest Reply
ahmed_zarar
New Contributor III
  • 3 kudos

Thank you , I got it.

  • 3 kudos
1 More Replies
hpant
by New Contributor III
  • 6457 Views
  • 9 replies
  • 7 kudos

Resolved! Where exactly I should create Volume in a catalog?

Currently my Databricks looks like this: I want to create volume to access external location. Where exactly should I create it? Should a create new schema in "poe" catalog and create a volume inside it or create it in a existing schema? What is the b...

hpant_0-1722505474676.png
  • 6457 Views
  • 9 replies
  • 7 kudos
Latest Reply
hpant1
New Contributor III
  • 7 kudos

No, I don't have.  

  • 7 kudos
8 More Replies
juanicobsider
by Databricks Partner
  • 2225 Views
  • 2 replies
  • 3 kudos

How to parse VARIANT type column using Pyspark sintax?

I trying to parse VARIANT data type column, what is the correct sintax to parse sub columns using Pyspark, is it possible?.I'd like to know how to do it this way (I know how to do it using SQL syntax).   

juanicobsider_0-1722907722976.png juanicobsider_1-1722907840323.png juanicobsider_2-1722907947212.png
  • 2225 Views
  • 2 replies
  • 3 kudos
Latest Reply
Witold
Databricks Partner
  • 3 kudos

As an addition to what @szymon_dybczak already said correctly. It's actually not a workaround, it's designed and documented that way. Make sure that you understand the difference between `:`, and `.`.Regarding PySpark, the API has other variant relat...

  • 3 kudos
1 More Replies
tramtran
by Contributor
  • 7911 Views
  • 6 replies
  • 7 kudos

Make the job fail if a task fail

Hi everyone,I have a job with 2 tasks running independently. If one of them fails, the remaining task continues to run. I would like the job to fail if any task fails.Is there any way to do that?Thank you!

  • 7911 Views
  • 6 replies
  • 7 kudos
Latest Reply
Edthehead
Contributor III
  • 7 kudos

Extending to what @mhiltner has suggested, let's  say you have 2 streaming tasks streamA and streamB. Create 2 separate tasks taskA and taskB. Each of these tasks should execute the same notebook which makes an API call to the CANCEL RUN or CANCEL AL...

  • 7 kudos
5 More Replies
DanR
by New Contributor III
  • 21276 Views
  • 4 replies
  • 3 kudos

PermissionError: [Errno 1] Operation not permitted: '/Volumes/mycatalog'

We are having intermittent errors where a Job Task cannot access a Catalog through a Volume, with the error: `PermissionError: [Errno 1] Operation not permitted: '/Volumes/mycatalog'`.The Job has 40 tasks running in parallel and every few runs we exp...

Data Engineering
Unity Catalog
Volumes
  • 21276 Views
  • 4 replies
  • 3 kudos
Latest Reply
NandiniN
Databricks Employee
  • 3 kudos

It appears to be a concurrency limitation, and there were fixes in the past but there is a possibility it may be a new code flow, adding a retry to the operation can mitigate the issue and work as a workaround. But you can report the issue with Datab...

  • 3 kudos
3 More Replies
delta_bravo
by New Contributor
  • 10143 Views
  • 2 replies
  • 0 kudos

Cluster termination issue

I am using Databricks as a Community Edition user with a limited cluster (just 1 Driver: 15.3 GB Memory, 2 Cores, 1 DBU). I am trying to run some custom algorithms for continuous calculations and writing results to the delta table every 15 minutes al...

  • 10143 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

If you set the "Terminate after" setting to 0 minutes during the creation of an all-purpose compute, it means that the auto-termination feature will be turned off. This is because the "Terminate after" setting is used to specify an inactivity period ...

  • 0 kudos
1 More Replies
curiousoctopus
by New Contributor III
  • 6653 Views
  • 4 replies
  • 4 kudos

Run multiple jobs with different source code at the same time with Databricks asset bundles

Hi,I am migrating from dbx to databricks asset bundles. Previously with dbx I could work on different features in separate branches and launch jobs without issue of one job overwritting the other. Now with databricks asset bundles it seems like I can...

  • 6653 Views
  • 4 replies
  • 4 kudos
Latest Reply
mo_moattar
New Contributor III
  • 4 kudos

We have the same issue. We might have multiple open PR on the bundles that are deploying the code, pipelines, jobs, etc. to the same workspace before the merge and they keep overwriting each other in the workspace.The jobs already have a separate ID ...

  • 4 kudos
3 More Replies
narenderkumar53
by Databricks Partner
  • 2072 Views
  • 3 replies
  • 2 kudos

can we parameterize the tags in the job compute

I want to monitor the cost better for the databricks job computes.I am using tags in the cluster to monitor cost.The tag values is static as of now.can we parameterize the compute the job cluster so that I can pass the tag values during the runtime a...

  • 2072 Views
  • 3 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @,If you're using ADF you can look at below article:Applying Dynamic Tags To Databricks Job Clusters in Azure Data Factory | by Kyle Hale | MediumIf not, I think you can try to write some code that will use below endpoint. The idea is, before exec...

  • 2 kudos
2 More Replies
Jeewan
by New Contributor
  • 1047 Views
  • 0 replies
  • 0 kudos

Partition In Spark with subqeury which include Union

I have a SQL query like this:select ... from table1 where id in (slect id from table 1 where (some condition) UNION select id from table2 where (some condition)) table1I have made a partition of 200 where upper bound is 200 and lower bound is 0 and p...

  • 1047 Views
  • 0 replies
  • 0 kudos
Prashanth24
by New Contributor III
  • 3160 Views
  • 3 replies
  • 3 kudos

Resolved! Databricks workflow each task cost

Suppose if we have 4 tasks (3 notebooks and 1 normal python code) in a workflow then i would like to know the cost incurred for each task in the Databricks workflow. Please let me know the any way to find out this details.

  • 3160 Views
  • 3 replies
  • 3 kudos
Latest Reply
Edthehead
Contributor III
  • 3 kudos

If each of the tasks are sharing the same cluster then no, you cannot differentiate the costs between the tasks.  However, if you setup each task to have its own job cluster, then pass some custom tags and you can then differentiate/report the costs ...

  • 3 kudos
2 More Replies
guangyi
by Contributor III
  • 880 Views
  • 0 replies
  • 0 kudos

Confuse about large memory usage of cluster

We set up a demo DLT pipeline with no data involved:  @Dlt.table( name="demo" ) def sample(): df = spark.sql("SELECT 'silver' as Layer") return df However, when we check the metric of the cluster, it looks like 10GB memory has already be...

  • 880 Views
  • 0 replies
  • 0 kudos
Labels