Data Engineering

Forum Posts

Sorted by:

by stevenayers-bge • New Contributor II

Friday

66 Views
1 replies
0 kudos

Autoloader: Read old version of file. Read modification time is X, latest modification time is X

I'm recieving this error from autoloader. It seems to be stuck on this one file. I don't care when it was read and last modified, I just want to ingest it. Any ideas?java.io.IOException: Read old version of file s3a://<file-path>.json. Read modificat...

Data Engineering

66 Views
1 replies
0 kudos

Friday

View Replies

Latest Reply

Kaniz
Community Manager

a minute ago

0 kudos

Hi @stevenayers-bge, The error message indicates that the file you’re trying to read is an old version, and there’s a discrepancy between the read modification time and the latest modification time. Let’s explore some potential solutions based on ...

0 kudos

a minute ago

by jainshasha • New Contributor

Monday

53 Views
1 replies
0 kudos

Job Cluster in Databricks workflow

Hi,I have configured 20 different workflows in Databricks. All of them configured with job cluster with different name. All 20 workfldows scheduled to run at same time. But even configuring different job cluster in all of them they run sequentially w...

Data Engineering

53 Views
1 replies
0 kudos

Monday

View Replies

Latest Reply

Kaniz
Community Manager

3m ago

0 kudos

Hi @jainshasha, Running multiple workflows in parallel with their own job clusters in Databricks can be achieved by following the right configuration. Let’s explore some options: Shared Job Clusters: To optimize resource usage with jobs that orch...

0 kudos

3m ago

by LeoGaller • New Contributor

Monday

88 Views
1 replies
0 kudos

What are the options for "spark_conf.spark.databricks.cluster.profile"?

Hey guys, I'm trying to find what are the options we can pass to spark_conf.spark.databricks.cluster.profileI know looking around that some of the available configs are singleNode and serverless, but there are others?Where is the documentation of it?...

Data Engineering

88 Views
1 replies
0 kudos

Monday

View Replies

Latest Reply

Kaniz
Community Manager

5m ago

0 kudos

Hi @LeoGaller , The spark_conf.spark.databricks.cluster.profile configuration in Databricks allows you to specify the profile for a cluster. Let’s explore the available options and where you can find the documentation. Available Profiles: Sing...

0 kudos

5m ago

by htu • Visitor

28m ago

11 Views
0 replies
0 kudos

Installing Databricks Connect breaks pyspark local cluster mode

Hi, It seems that when databricks-connect is installed, pyspark is at the same time modified so that it will not anymore work with local master node. This has been especially useful in testing, when unit tests for spark-related code without any remot...

Data Engineering

11 Views
0 replies
0 kudos

28m ago

by stevenayers-bge • New Contributor II

21 hours ago

78 Views
2 replies
1 kudos

Bug: Shallow Clone `create or replace` causing [TABLE_OR_VIEW_NOT_FOUND]

I am having an issue where when I do a shallow clone using :create or replace table `catalog_a_test`.`schema_a`.`table_a` shallow clone `catalog_a`.`schema_a`.`table_a` I get:[TABLE_OR_VIEW_NOT_FOUND] The table or view catalog_a_test.schema_a.table_a...

Data Engineering

78 Views
2 replies
1 kudos

21 hours ago

View Replies

Latest Reply

Omar_hamdan
Community Manager

16 hours ago

1 kudos

Hi StevenThis is really a strange issue. First let's exclude some possible causes for this. We need to check the following:- The permission to table A and Catalog B. take a look at the following link to check what permission is needed: https://docs.d...

1 kudos

16 hours ago

1 More Replies

by Red1 • New Contributor III

12-30-2023 5:50:09 PM

913 Views
8 replies
2 kudos

Autoingest not working with Unity Catalog in DLT pipeline

Hey Everyone,I've built a very simple pipeline with a single DLT using auto ingest, and it works, provided I don't specify the output location. When I build the same pipeline but set UC as the output location, it fails when setting up S3 notification...

Data Engineering

913 Views
8 replies
2 kudos

12-30-2023 5:50:09 PM

View Replies

Latest Reply

Red1
New Contributor III

8 hours ago

2 kudos

Hey @Babu_Krishnan I was! I had to reach out to my Databricks support engineer directly and the resolution was to add "cloudfiles.awsAccessKey" and "cloudfiles.awsSecretKey" to the params as in the screenshot below (apologies, i don't know why the sc...

2 kudos

8 hours ago

7 More Replies

by Mado • Valued Contributor II

12-12-2022 9:07:10 PM

7638 Views
4 replies
2 kudos

Resolved! Using "Select Expr" and "Stack" to Unpivot PySpark DataFrame doesn't produce expected results

I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.Sample dataset:# Prepare Data data = [("Spain", 101, 201, 301), \ ("Taiwan", 102, 202, 302), \ ("Italy", 103, 203, 303), \ ("China", 104, 204, 304...

Data Engineering

7638 Views
4 replies
2 kudos

12-12-2022 9:07:10 PM

View Replies

Latest Reply

lukeoz
Visitor

6 hours ago

2 kudos

You can also use backticks around the column names that would otherwise be recognised as numbers.from pyspark.sql import functions as F unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)" unPivotDF = df.select("C...

2 kudos

6 hours ago

3 More Replies

by amitkmaurya • Visitor

21 hours ago

57 Views
1 replies
0 kudos

How to increase executor memory in Databricks jobs

May be I am new to Databricks that's why I have confusion.Suppose I have worker memory of 64gb in Databricks job max 12 nodes...and my job is failing due to Executor Lost due to 137 (OOM if found on internet).So, to fix this I need to increase execut...

Data Engineering

57 Views
1 replies
0 kudos

21 hours ago

View Replies

Latest Reply

raphaelblg
New Contributor III

7 hours ago

0 kudos

Hello @amitkmaurya , Increasing compute resources may not always be the best strategy. To gain more insights into each executor's memory usage, check the cluster metrics tab and Spark UI for your cluster. If one executor has a much higher memory usag...

0 kudos

7 hours ago

by Devsql • Visitor

18 hours ago

57 Views
1 replies
0 kudos

How to find that given Parquet file got imported into Bronze Layer ?

Hi Team,Recently we had created new Databricks project/solution (based on Medallion architecture) having Bronze-Silver-Gold Layer based tables. So we have created Delta-Live-Table based pipeline for Bronze-Layer implementation. Source files are Parqu...

Data Engineering

Azure Databricks

Bronze Job

Delta Live Table

Delta Live Table Pipeline

57 Views
1 replies
0 kudos

18 hours ago

View Replies

Latest Reply

raphaelblg
New Contributor III

8 hours ago

0 kudos

Hello @Devsql , It appears that you are creating DLT bronze tables using a standard spark.read operation. This may explain why the DLT table doesn't include "new files" during a REFRESH operation. For incremental ingestion of bronze layer data into y...

0 kudos

8 hours ago

by 6502 • New Contributor III

16 hours ago

57 Views
1 replies
0 kudos

Delete on streaming table and starting startingVersion

I deleted for mistake some records from a streaming table, and of course, the streaming job stopped working. So I restored the table at the version before the delete was done, and attempted to restart the job using the startingVersion to the new vers...

Data Engineering

57 Views
1 replies
0 kudos

16 hours ago

View Replies

Latest Reply

raphaelblg
New Contributor III

8 hours ago

0 kudos

Hello @6502, It appears you've used the `startingVersion` parameter in your streaming query, which causes the stream to begin processing data from the version prior to the DELETE operation version. However, the DELETE operation will still be processe...

0 kudos

8 hours ago

by Erik_L • Contributor II

9 hours ago

51 Views
0 replies
0 kudos

BUG: Unity Catalog kills UDF

We have UDFs in a few locations and today we noticed they died in performance. This seems to be caused by Unity Catalog.Test environment 1:Databricks Runtime Environment: 14.3 / 15.1Compute: 1 master, 4 nodesPolicy: UnrestrictedAccess Mode: SharedTes...

Data Engineering

51 Views
0 replies
0 kudos

9 hours ago

by nilton • Visitor

11 hours ago

80 Views
2 replies
0 kudos

Query table based on table_name from information_schema

Hi,I have one table that changes the name every 60 days. The name simple increases the number version, for example:* Firtst 60 days: table_name_v1. After 60 days: table_name_v2 and so on.What i want is to query the table wich name returned in the que...

Data Engineering

80 Views
2 replies
0 kudos

11 hours ago

View Replies

Latest Reply

radothede
Visitor

10 hours ago

0 kudos

The simpliest way would be propably using spark.sql%py tbl_name = 'table_v1' df = spark.sql(f'select * from {tbl_name}') display(df) From there, You can simply create temporary view:%py df.createOrReplaceTempView('table_act')and query it using SQL st...

0 kudos

10 hours ago

1 More Replies

by radothede • Visitor

11 hours ago

38 Views
0 replies
0 kudos

Can on-demand clusters be shared across multiple jobs using cluster pool with max capacity ?

I have a cluster pool with max capacity. I run multiple jobs against that cluster pool.Can on-demand clusters, created within this cluster pool, be shared across multiple different jobs, at the same time?The reason I'm asking is I can see a downgrade...

Data Engineering

38 Views
0 replies
0 kudos

11 hours ago

by rt-slowth • Contributor

01-10-2024 6:33:50 PM

949 Views
5 replies
2 kudos

AutoLoader File notification mode Configuration with AWS

from pyspark.sql import functions as F from pyspark.sql import types as T from pyspark.sql import DataFrame, Column from pyspark.sql.types import Row import dlt S3_PATH = 's3://datalake-lab/XXXXX/' S3_SCHEMA = 's3://datalake-lab/XXXXX/schemas/' ...

Data Engineering

949 Views
5 replies
2 kudos

01-10-2024 6:33:50 PM

View Replies

Latest Reply

djhs
New Contributor III

Tuesday

2 kudos

Was this resolved? I run into the same issue

2 kudos

Tuesday

4 More Replies

by 185369 • New Contributor II

05-04-2023 1:58:11 AM

868 Views
4 replies
1 kudos

Resolved! DLT with UC Access Denied sqs

I am going to use the newly released DLT with UC.But it keeps getting access denied. As I keep tracking the reasons, it seems that an account. ID other than my account ID or Databricks account ID is being requested.I cannot use '*' in principal attri...

Data Engineering

868 Views
4 replies
1 kudos

05-04-2023 1:58:11 AM

View Replies

Latest Reply

Priyag1
Honored Contributor II

05-04-2023 2:27:58 AM

1 kudos

Every service on AWS, an SQS queue, and all the other services in your stack using that queue will be configured with minimal permissions, leading to access issues. So, make sure you get your IAM policies set up correctly before deploying to producti...

1 kudos

05-04-2023 2:27:58 AM

3 More Replies

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Autoloader: Read old version of file. Read modification time is X, latest modification time is X

Job Cluster in Databricks workflow

What are the options for "spark_conf.spark.databricks.cluster.profile"?

Installing Databricks Connect breaks pyspark local cluster mode

Bug: Shallow Clone `create or replace` causing [TABLE_OR_VIEW_NOT_FOUND]

Autoingest not working with Unity Catalog in DLT pipeline

Resolved! Using "Select Expr" and "Stack" to Unpivot PySpark DataFrame doesn't produce expected results

How to increase executor memory in Databricks jobs

How to find that given Parquet file got imported into Bronze Layer ?

Delete on streaming table and starting startingVersion

BUG: Unity Catalog kills UDF

Query table based on table_name from information_schema

Can on-demand clusters be shared across multiple jobs using cluster pool with max capacity ?

AutoLoader File notification mode Configuration with AWS

Resolved! DLT with UC Access Denied sqs

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...