Data Engineering

Forum Posts

Sorted by:

by jenshumrich • New Contributor III

2 weeks ago

108 Views
2 replies
0 kudos

Filter not using partition

I have the following code:spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints") result_df.repartitionByRange(200, "IdStation") result_df_checked = result_df.checkpoint(eager=True) unique_stations = result_df.select("IdStation...

Data Engineering

108 Views
2 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

jenshumrich
New Contributor III

an hour ago

0 kudos

Thanks a lot for your response. It seems the Filter is not pushed down, no? station_df.explain() == Physical Plan == *(1) Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844)) +- *(1) Scan ExistingRDD[Date#2718,WindSpeed#2675,Tower_Accele...

0 kudos

an hour ago

1 More Replies

by israelst • New Contributor II

01-15-2024 12:53:49 AM

268 Views
2 replies
0 kudos

DLT can't authenticate with kinesis using instance profile

When running my notebook using personal compute with instance profile I am indeed able to readStream from kinesis. But adding it as a DLT with UC, while specifying the same instance-profile in the DLT pipeline setting - causes a "MissingAuthenticatio...

Data Engineering

Delta Live Tables

Unity Catalog

268 Views
2 replies
0 kudos

01-15-2024 12:53:49 AM

View Replies

Latest Reply

Mathias_Peters
New Contributor II

2 hours ago

0 kudos

Hi, were you able to solve this problem? If so, what was the solution?

0 kudos

2 hours ago

1 More Replies

by angel_ba • New Contributor II

3 hours ago

19 Views
0 replies
0 kudos

unity catalog system.access.audit lag

Hello,We have unity catalog enabled workspace. To get the completion time of a pipeline that runs multiple times a day, I am checking system.access.audit table. Comparing the completion time of the pipeline compared to other pipeline time I am creat...

Data Engineering

19 Views
0 replies
0 kudos

3 hours ago

by nikhilkumawat • New Contributor III

04-27-2023 6:37:46 AM

4851 Views
6 replies
3 kudos

Resolved! Get file information while using "Trigger jobs when new files arrive" https://docs.databricks.com/workflows/jobs/file-arrival-triggers.html

I am currently trying to use this feature of "Trigger jobs when new file arrive" in one of my project. I have an s3 bucket in which files are arriving on random days. So I created a job to and set the trigger to "file arrival" type. And within the no...

Data Engineering

4851 Views
6 replies
3 kudos

04-27-2023 6:37:46 AM

View Replies

Latest Reply

adriennn
New Contributor III

3 hours ago

3 kudos

Looks like a major oversight not to be able to get the information on what file(s) have triggered the job. Anyway, the above explanations given by Anon read like the replies of ChatGPT, especially the scenario where a dataframe is passed to a trigger...

3 kudos

3 hours ago

5 More Replies

by BerkerKozan • New Contributor III

3 hours ago

28 Views
0 replies
0 kudos

Using AAD Spn on AWS Databricks

I use AWS Databricks which has an SSO&Scim integration with AAD. I generated an SPN in AAD, synced it to Databricks, and want to use this SPN with using AAD client secrets to use Databricks SDK. But it doesnt work. I dont want to generate another tok...

Data Engineering

28 Views
0 replies
0 kudos

3 hours ago

by Oliver_Angelil • Valued Contributor II

4 hours ago

34 Views
0 replies
0 kudos

Append-only table from non-streaming source in Delta Live Tables

I have a DLT pipeline, where all tables are non-streaming (materialized views), except for the last one, which needs to be append-only, and is therefore defined as a streaming table.The pipeline runs successfully on the first run. However on the seco...

Data Engineering

34 Views
0 replies
0 kudos

4 hours ago

by Anske • New Contributor II

4 hours ago

26 Views
0 replies
0 kudos

DLT apply_changes applies only deletes and inserts not updates

Hi,I have a DLT pipeline that applies changes from a source table (cdctest_cdc_enriched) to a target table (cdctest), by the following code:dlt.apply_changes( target = "cdctest", source = "cdctest_cdc_enriched", keys = ["ID"], sequence_by...

Data Engineering

Delta Live Tables

26 Views
0 replies
0 kudos

4 hours ago

by zahra_Khedri • Visitor

6 hours ago

58 Views
1 replies
0 kudos

An error occurred when loading Jobs and Workflows App.

Hi,I was trying to open the Workflows but there is an error "An error occurred when loading Jobs and Workflows App." we need help to know why it happened and how we can resolve it please.

Data Engineering

58 Views
1 replies
0 kudos

6 hours ago

View Replies

Latest Reply

GeoPer
Visitor

6 hours ago

0 kudos

Same...and the weirdest is that all of the services looks healthy in https://status.databricks.com/Region: eu-central-1Provider: AWSCould anyone provide some info here?

0 kudos

6 hours ago

by stepysamud • Visitor

6 hours ago

66 Views
0 replies
0 kudos

Workflow UI broken after creating job via the api

Hi all,I'm in the progress of migrating from Databricks Azure to Databricks AWS.One part of this is migrating all our workflows which I wanted to via the /api/2.1/jobs/create api with the workflow passed via the json body. I have successfully created...

Data Engineering

66 Views
0 replies
0 kudos

6 hours ago

by madrhr • New Contributor

yesterday

70 Views
2 replies
1 kudos

SparkContext lost when running %sh script.py

I need to execute a .py file in Databricks from a notebook (with arguments which for simplicity i exclude here). For this i am using:%sh script.pyscript.py:from pyspark import SparkContext def main(): sc = SparkContext.getOrCreate() print(sc...

Data Engineering

%sh

.py

bash shell

SparkContext

SparkShell

70 Views
2 replies
1 kudos

yesterday

View Replies

Latest Reply

Yeshwanth
Contributor III

yesterday

1 kudos

@madrhr I think this occurs because one session is initiated within the Python script (.py file), while in the Databricks notebook, we have a pre-configured Spark session. It is important to note that we cannot use more than one Spark session per not...

1 kudos

yesterday

1 More Replies

by niruban • New Contributor

yesterday

34 Views
0 replies
0 kudos

Migrate a notebook that reside in workspace using Databricks Asset Bundle

Hello Community Folks -Did anyone implemented migration of notebooks that is in workspace to production databricks workspace using Databricks Asset Bundle? If so can you please help me with any documentation which I can refer? Thanks!!RegardsNiruban ...

Data Engineering

34 Views
0 replies
0 kudos

yesterday

by deng_dev • New Contributor III

yesterday

132 Views
1 replies
0 kudos

Cached Views in MERGE INTO operation

Hi everyone!I want to use in-memory cached views in a merge into operation, but I am not entirely sure if the exactly saved in-memory view is used in this operation or not.So, suppose I have a table named table_1 and a cached view named cached_view_1...

Data Engineering

132 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

shan_chandra
Honored Contributor III

yesterday

0 kudos

@deng_dev - Are you using external metastore by any chance. From the physical plan, we could see the catalog`.`db`.`table_1` is not cached. If it is glue catalog, then caching can be enabled based on the below configs in the article below https://do...

0 kudos

yesterday

by Anonymous • Not applicable

06-07-2021 10:50:07 AM

4841 Views
15 replies
8 kudos

Resolved! What are some best practices for CICD?

A number of people have questions on using Databricks in a productionalized environment. What are the best practices to enable CICD automation?

Data Engineering

4841 Views
15 replies
8 kudos

06-07-2021 10:50:07 AM

View Replies

Latest Reply

BaivabMohanty
New Contributor II

yesterday

8 kudos

Any leads/posts for Databricks CI/CD integration with Bitbucket pipeline. I am facing the below error while I creation my CICD pipeline pipelines:branches:master:- step:name: Deploy Databricks Changesimage: docker:19.03.12services:- dockerscript:# U...

8 kudos

yesterday

14 More Replies

by Sambit_S • New Contributor II

yesterday

93 Views
0 replies
0 kudos

Error during deserializing protobuf data

I am receiving protobuf data in a json attribute and along with it I receive a descriptor file.I am using from_protobuf to deserialize the data as below,It works most of the time but giving error when there are some recursive fields within the protob...

Data Engineering

93 Views
0 replies
0 kudos

yesterday

by drag7ter • New Contributor II

yesterday

521 Views
2 replies
0 kudos

Resolved! Not able to set run_as service_principal_name

I'm trying to run: databricks bundle deploy -t prod --profile PROD_Service_Principal My bundle looks: bundle: name: myproject include: - resources/jobs/bundles/*.yml targets: # The 'dev' target, for development purposes. This target is the de...

Data Engineering

521 Views
2 replies
0 kudos

yesterday

View Replies

Latest Reply

drag7ter
New Contributor II

yesterday

0 kudos

In my case I replaced alias PROD_Service_Principal with id c250831b-5a2a-4461-a855-83b9102f797e and it works. Not intuitive, probably this is a bug in CLI ot bundles service_principal_name: c250831b-5a2a-4461-a855-83b9102f797e

0 kudos

yesterday

1 More Replies

User

Count

1600

735

343

284

246

Databricks

Forum Posts

Filter not using partition

DLT can't authenticate with kinesis using instance profile

unity catalog system.access.audit lag

Resolved! Get file information while using "Trigger jobs when new files arrive" https://docs.databricks.com/workflows/jobs/file-arrival-triggers.html

Using AAD Spn on AWS Databricks

Append-only table from non-streaming source in Delta Live Tables

DLT apply_changes applies only deletes and inserts not updates

An error occurred when loading Jobs and Workflows App.

Workflow UI broken after creating job via the api

SparkContext lost when running %sh script.py

Migrate a notebook that reside in workspace using Databricks Asset Bundle

Cached Views in MERGE INTO operation

Resolved! What are some best practices for CICD?

Error during deserializing protobuf data

Resolved! Not able to set run_as service_principal_name

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...

I have to run the notebook in concurrently using p...