Data Engineering

Forum Posts

Sorted by:

by MadelynM • Databricks Employee

08-16-2022 1:29:35 AM

9603 Views
2 replies
0 kudos

Delta Live Tables + S3 | 5 tips for cloud storage with DLT

You’ve gotten familiar with Delta Live Tables (DLT) via the quickstart and getting started guide. Now it’s time to tackle creating a DLT data pipeline for your cloud storage–with one line of code. Here’s how it’ll look when you're starting:CREATE OR ...

Data Engineering

9603 Views
2 replies
0 kudos

08-16-2022 1:29:35 AM

View Replies

Latest Reply

waynelxb
New Contributor II

10-13-2024 5:43:03 AM

0 kudos

Hi MadelynM,How should we handle Source File Archival and Data Retention with DLT? Source File Archival: Once the data from source file is loaded with DLT Auto Loader, we want to move the source file from source folder to archival folder. How can we ...

0 kudos

10-13-2024 5:43:03 AM

1 More Replies

by CarterM • New Contributor III

09-29-2022 4:46:56 PM

7032 Views
3 replies
2 kudos

Resolved! Why Spark Streaming from S3 is returning thousands of files when there are only 9?

I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 buck...

Data Engineering

7032 Views
3 replies
2 kudos

09-29-2022 4:46:56 PM

View Replies

Latest Reply

Anonymous
Not applicable

10-03-2022 3:04:39 PM

2 kudos

@Carter Mooring Thank you SO MUCH for coming back to provide a solution to your thread! Happy you were able to figure this out so quickly. And I am sure that this will help someone in the future with the same issue.

2 kudos

10-03-2022 3:04:39 PM

2 More Replies

by JasonN • New Contributor II

09-04-2022 8:25:40 PM

2324 Views
2 replies
2 kudos

Resolved! DLT Cluster accessing to S3 bucket without Instance Profile attached

Hi Team,Can anyone please help me figure out how to configure Delta Live Tables Cluster accessing AWS S3 bucket without Instance profile defined in Cluster's JSON?The idea is, the user who is running the DLT Cluster has Storage Credentials and Extern...

Data Engineering

2324 Views
2 replies
2 kudos

09-04-2022 8:25:40 PM

View Replies

Latest Reply

Vivian_Wilfred
Databricks Employee

09-05-2022 4:35:54 AM

2 kudos

Hi @Jason Nam , DLT and unity catalog are not integrated yet. The cluster-notebook setup uses UC and can access S3 but not the DLT jobs. Please check the limitations in this document (7th point):https://docs.databricks.com/release-notes/unity-catalo...

2 kudos

09-05-2022 4:35:54 AM

1 More Replies

by 77796 • New Contributor II

08-23-2022 9:21:15 AM

4621 Views
4 replies
0 kudos

Databricks S3A error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not found

We are getting the below error for runtime 10.x and 11.x when writing to s3 via saveAsNewAPIHadoopFile function. The same jobs are running fine on runtime 9.x and 7.x. The difference betwen 9.x and 10.x is the former has hadoop 2.7 bindings with sp...

Data Engineering

4621 Views
4 replies
0 kudos

08-23-2022 9:21:15 AM

View Replies

Latest Reply

77796
New Contributor II

08-28-2022 9:25:08 AM

0 kudos

We have resolved this issue by using s3 scheme instead of s3a i.e. pairRDD.saveAsNewAPIHadoopFile("s3://bucket/testout.dat",

0 kudos

08-28-2022 9:25:08 AM

3 More Replies

by matty_f • New Contributor II

08-03-2022 1:43:13 PM

1075 Views
0 replies
0 kudos

Help integrating streaming pipeline with AWS services S3 and Lambda

Hi there, I am trying to build a delta live tables pipeline that ingests gzip compressed archives as they're uploaded to S3. The archives contain 2 files in a proprietary format, and one is needed to determine how to parse the other. Once the file co...

Data Engineering

1075 Views
0 replies
0 kudos

08-03-2022 1:43:13 PM

by tej1 • New Contributor III

05-16-2022 10:07:45 AM

4170 Views
5 replies
7 kudos

Resolved! Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles. And it is necessary to access the file modification timestamp of the file. As documented here, we tried selecting `_metadata` column in a task in delta live p...

Data Engineering

4170 Views
5 replies
7 kudos

05-16-2022 10:07:45 AM

View Replies

Latest Reply

tej1
New Contributor III

08-03-2022 5:54:25 AM

7 kudos

Update: We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.

7 kudos

08-03-2022 5:54:25 AM

4 More Replies

by Megan05 • New Contributor III

06-29-2022 9:04:38 AM

3552 Views
4 replies
1 kudos

Trying to write to S3 bucket but executed code not showing any progress

I am trying to write data from databricks to an S3 bucket but when I submit the code, it runs and runs and does not make any progress. I am not getting any errors and the logs don't seem to recognize I've submitted anything. The cluster also looks un...

Data Engineering

3552 Views
4 replies
1 kudos

06-29-2022 9:04:38 AM

View Replies

Latest Reply

User16753725469
Contributor II

07-15-2022 6:40:48 AM

1 kudos

Can you please check the driver log4j to see what is happening?

1 kudos

07-15-2022 6:40:48 AM

3 More Replies

by vivek_sinha • Contributor

06-10-2022 5:39:04 PM

23520 Views
3 replies
4 kudos

Resolved! PySpark on Jupyterhub K8s || Unable to query data || Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Pyspark Version: 2.4.5 Hive Version: 1.2 Hadoop Version: 2.7 AWS-SDK Jar: 1.7.4 Hadoop-AWS: 2.7.3When I am trying to show data I am getting Class org.apache.hadoop.fs.s3a.S3AFileSystem not found while I am passing all the information which all are re...

Data Engineering

23520 Views
3 replies
4 kudos

06-10-2022 5:39:04 PM

View Replies

Latest Reply

vivek_sinha
Contributor

06-12-2022 12:49:46 AM

4 kudos

Hi @Arvind Ravish Thanks for the response and now I fixed the issue.The image which I was using to launch spark executor didn't have aws jars. After doing necessary changes it started working.But still may thanks for your response.

4 kudos

06-12-2022 12:49:46 AM

2 More Replies

by Vee • New Contributor

04-11-2022 11:38:26 AM

3563 Views
1 replies
0 kudos

Tips for resolving follolwing errors related to AWS S3 read / write

Job aborted due to stage failure: Task 0 in stage 3084.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3084.0 (TID...., ip..., executor 0): org.apache.spark.SparkExecution: Task failed while writing rowsJob aborted due to stage failure:...

Data Engineering

3563 Views
1 replies
0 kudos

04-11-2022 11:38:26 AM

View Replies

by Constantine • Contributor III

04-10-2022 10:56:12 PM

3541 Views
1 replies
5 kudos

Resolved! Unable to create a partitioned table on s3 data

I write data to s3 like data.write.format("delta").mode("append").option("mergeSchema", "true").save(s3_location)and create a partitioned table likeCREATE TABLE IF NOT EXISTS demo_table USING DELTA PARTITIONED BY (column_a) LOCATION {s3_location};whi...

Data Engineering

3541 Views
1 replies
5 kudos

04-10-2022 10:56:12 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

04-11-2022 1:10:21 PM

5 kudos

@John Constantine , In CREATE TABLE, you need to specify fields:CREATE TABLE IF NOT EXISTS demo_table (column_a STRING, number INT) USING DELTA PARTITIONED BY (column_a) LOCATION {s3_location};and when you save data before creating ...

5 kudos

04-11-2022 1:10:21 PM

by bonjih • New Contributor

02-19-2022 6:38:32 PM

7677 Views
3 replies
3 kudos

Resolved! AttributeError: module 'dbutils' has no attribute 'fs'

Hi,Using db in SageMaker to connect EC2 to S3. Following other examples I get 'AttributeError: module 'dbutils' has no attribute 'fs'....I guess Im missing an import?

Data Engineering

7677 Views
3 replies
3 kudos

02-19-2022 6:38:32 PM

View Replies

Latest Reply

Atanu
Databricks Employee

03-15-2022 9:36:29 PM

3 kudos

agree with @Werner Stinckens . also may try importing dbutils - @ben Hamilton

3 kudos

03-15-2022 9:36:29 PM

2 More Replies

by hari • Contributor

02-23-2022 5:35:27 AM

2611 Views
3 replies
3 kudos

Resolved! Multi-cluster write for delta tables with s3 as the datastore

Does Delta currently support multi-cluster writes to delta table in s3?I see in the data bricks documentation that data bricks doesn't support writing to the same table from multiple spark drivers and thus multiple clusters.But s3Guard was also added...

Data Engineering

2611 Views
3 replies
3 kudos

02-23-2022 5:35:27 AM

View Replies

Latest Reply

nastasiya09
New Contributor II

02-23-2022 11:53:40 AM

3 kudos

that's really good post for memobdroverizon wifi

3 kudos

02-23-2022 11:53:40 AM

2 More Replies

by lsoewito • New Contributor

02-09-2022 4:50:34 PM

6727 Views
1 replies
1 kudos

How to configure Databricks Connect to 'Assume Role' when accessing file from an AWS S3 bucket?

I have a Databricks cluster configured with an instance profile to assume role when accessing an AWS S3 bucket.Accessing the bucket from the notebook using the cluster works properly (the instance profile can assume role to access the bucket).However...

Data Engineering

6727 Views
1 replies
1 kudos

02-09-2022 4:50:34 PM

View Replies

Latest Reply

Anonymous
Not applicable

02-10-2022 7:04:10 AM

1 kudos

Hello, @lsoewito - My name is Piper, and I'm a moderator for the Databricks community. Welcome and thank you for coming to us with your question. I'm sorry to hear that you're having trouble. Let's give your peers a chance to answer your question. W...

1 kudos

02-10-2022 7:04:10 AM

by Constantine • Contributor III

12-15-2021 4:36:06 PM

2061 Views
1 replies
2 kudos

Do we have delta table access logs ?

I have delta tables on databricks with AWS s3.Are there any logs or anything else to figure out who all are accessing a particular DB or tables.

Data Engineering

2061 Views
1 replies
2 kudos

12-15-2021 4:36:06 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

12-16-2021 12:03:16 AM

2 kudos

The thing that comes closest are Audit logs. Here is a list of log triggers.

2 kudos

12-16-2021 12:03:16 AM

by MohitAnchlia • New Contributor II

07-15-2021 11:22:31 AM

1190 Views
0 replies
1 kudos

Change AWS storage setting and account

I am seeing a super weird behaviour in databricks. We initially configured the following: 1. Account X in Account Console -> AWS Account arn:aws:iam::X:role/databricks-s3 2. We setup databricks-s3 as S3 bucket in Account Console -> AWS Storage 3. W...

Data Engineering

1190 Views
0 replies
1 kudos

07-15-2021 11:22:31 AM