cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

MadelynM
by Databricks Employee
  • 8397 Views
  • 2 replies
  • 0 kudos

Delta Live Tables + S3 | 5 tips for cloud storage with DLT

You’ve gotten familiar with Delta Live Tables (DLT) via the quickstart and getting started guide. Now it’s time to tackle creating a DLT data pipeline for your cloud storage–with one line of code. Here’s how it’ll look when you're starting:CREATE OR ...

Workflows-Left Nav Workflows
  • 8397 Views
  • 2 replies
  • 0 kudos
Latest Reply
waynelxb
New Contributor II
  • 0 kudos

Hi MadelynM,How should we handle Source File Archival and Data Retention with DLT? Source File Archival: Once the data from source file is loaded with DLT Auto Loader, we want to move the source file from source folder to archival folder. How can we ...

  • 0 kudos
1 More Replies
CarterM
by New Contributor III
  • 5965 Views
  • 4 replies
  • 2 kudos

Resolved! Why Spark Streaming from S3 is returning thousands of files when there are only 9?

I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 buck...

8_9 endpoint response structure Soccer  endpoint  9 9 endpoint responses in same s3 bucket
  • 5965 Views
  • 4 replies
  • 2 kudos
Latest Reply
williamyoung
New Contributor II
  • 2 kudos

Hello Everyone,It seems like the issue you're encountering could be related to how Spark Streaming interprets the S3 file structure, especially when dealing with multiple sources. When files from multiple endpoints are stored in the same bucket, Spar...

  • 2 kudos
3 More Replies
JasonN
by New Contributor II
  • 2073 Views
  • 2 replies
  • 2 kudos

Resolved! DLT Cluster accessing to S3 bucket without Instance Profile attached

Hi Team,Can anyone please help me figure out how to configure Delta Live Tables Cluster accessing AWS S3 bucket without Instance profile defined in Cluster's JSON?The idea is, the user who is running the DLT Cluster has Storage Credentials and Extern...

  • 2073 Views
  • 2 replies
  • 2 kudos
Latest Reply
Vivian_Wilfred
Databricks Employee
  • 2 kudos

Hi @Jason Nam​ , DLT and unity catalog are not integrated yet. The cluster-notebook setup uses UC and can access S3 but not the DLT jobs. Please check the limitations in this document (7th point):https://docs.databricks.com/release-notes/unity-catalo...

  • 2 kudos
1 More Replies
77796
by New Contributor II
  • 4220 Views
  • 4 replies
  • 0 kudos

Databricks S3A error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not found

We are getting the below error for runtime 10.x and 11.x when writing to s3 via saveAsNewAPIHadoopFile function. The same jobs are running fine on runtime 9.x and 7.x. The difference betwen 9.x and 10.x is the former has hadoop 2.7 bindings with sp...

  • 4220 Views
  • 4 replies
  • 0 kudos
Latest Reply
77796
New Contributor II
  • 0 kudos

We have resolved this issue by using s3 scheme instead of s3a i.e. pairRDD.saveAsNewAPIHadoopFile("s3://bucket/testout.dat",

  • 0 kudos
3 More Replies
tej1
by New Contributor III
  • 3727 Views
  • 5 replies
  • 7 kudos

Resolved! Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles. And it is necessary to access the file modification timestamp of the file. As documented here, we tried selecting `_metadata` column in a task in delta live p...

  • 3727 Views
  • 5 replies
  • 7 kudos
Latest Reply
tej1
New Contributor III
  • 7 kudos

Update: We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.

  • 7 kudos
4 More Replies
Megan05
by New Contributor III
  • 3125 Views
  • 4 replies
  • 1 kudos

Trying to write to S3 bucket but executed code not showing any progress

I am trying to write data from databricks to an S3 bucket but when I submit the code, it runs and runs and does not make any progress. I am not getting any errors and the logs don't seem to recognize I've submitted anything. The cluster also looks un...

image
  • 3125 Views
  • 4 replies
  • 1 kudos
Latest Reply
User16753725469
Contributor II
  • 1 kudos

Can you please check the driver log4j to see what is happening?

  • 1 kudos
3 More Replies
vivek_sinha
by Contributor
  • 21505 Views
  • 3 replies
  • 4 kudos

Resolved! PySpark on Jupyterhub K8s || Unable to query data || Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Pyspark Version: 2.4.5 Hive Version: 1.2 Hadoop Version: 2.7 AWS-SDK Jar: 1.7.4 Hadoop-AWS: 2.7.3When I am trying to show data I am getting Class org.apache.hadoop.fs.s3a.S3AFileSystem not found while I am passing all the information which all are re...

  • 21505 Views
  • 3 replies
  • 4 kudos
Latest Reply
vivek_sinha
Contributor
  • 4 kudos

Hi @Arvind Ravish​ Thanks for the response and now I fixed the issue.The image which I was using to launch spark executor didn't have aws jars. After doing necessary changes it started working.But still may thanks for your response.

  • 4 kudos
2 More Replies
Vee
by New Contributor
  • 3322 Views
  • 1 replies
  • 0 kudos

Tips for resolving follolwing errors related to AWS S3 read / write

Job aborted due to stage failure: Task 0 in stage 3084.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3084.0 (TID...., ip..., executor 0): org.apache.spark.SparkExecution: Task failed while writing rowsJob aborted due to stage failure:...

  • 3322 Views
  • 1 replies
  • 0 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 0 kudos

This widget could not be displayed.
Job aborted due to stage failure: Task 0 in stage 3084.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3084.0 (TID...., ip..., executor 0): org.apache.spark.SparkExecution: Task failed while writing rowsJob aborted due to stage failure:...

This widget could not be displayed.
  • 0 kudos
This widget could not be displayed.
Constantine
by Contributor III
  • 3255 Views
  • 1 replies
  • 5 kudos

Resolved! Unable to create a partitioned table on s3 data

I write data to s3 like data.write.format("delta").mode("append").option("mergeSchema", "true").save(s3_location)and create a partitioned table likeCREATE TABLE IF NOT EXISTS demo_table USING DELTA PARTITIONED BY (column_a) LOCATION {s3_location};whi...

  • 3255 Views
  • 1 replies
  • 5 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 5 kudos

@John Constantine​ , In CREATE TABLE, you need to specify fields:CREATE TABLE IF NOT EXISTS demo_table (column_a STRING, number INT) USING DELTA PARTITIONED BY (column_a) LOCATION {s3_location};and when you save data before creating ...

  • 5 kudos
bonjih
by New Contributor
  • 7091 Views
  • 3 replies
  • 3 kudos

Resolved! AttributeError: module 'dbutils' has no attribute 'fs'

Hi,Using db in SageMaker to connect EC2 to S3. Following other examples I get 'AttributeError: module 'dbutils' has no attribute 'fs'....I guess Im missing an import?

  • 7091 Views
  • 3 replies
  • 3 kudos
Latest Reply
Atanu
Databricks Employee
  • 3 kudos

agree with @Werner Stinckens​  . also may try importing dbutils - @ben Hamilton​ 

  • 3 kudos
2 More Replies
hari
by Contributor
  • 2375 Views
  • 3 replies
  • 3 kudos

Resolved! Multi-cluster write for delta tables with s3 as the datastore

Does Delta currently support multi-cluster writes to delta table in s3?I see in the data bricks documentation that data bricks doesn't support writing to the same table from multiple spark drivers and thus multiple clusters.But s3Guard was also added...

  • 2375 Views
  • 3 replies
  • 3 kudos
Latest Reply
nastasiya09
New Contributor II
  • 3 kudos

that's really good post for memobdroverizon wifi

  • 3 kudos
2 More Replies
lsoewito
by New Contributor
  • 6012 Views
  • 1 replies
  • 1 kudos

How to configure Databricks Connect to 'Assume Role' when accessing file from an AWS S3 bucket?

I have a Databricks cluster configured with an instance profile to assume role when accessing an AWS S3 bucket.Accessing the bucket from the notebook using the cluster works properly (the instance profile can assume role to access the bucket).However...

  • 6012 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hello, @lsoewito​ - My name is Piper, and I'm a moderator for the Databricks community. Welcome and thank you for coming to us with your question. I'm sorry to hear that you're having trouble. Let's give your peers a chance to answer your question. W...

  • 1 kudos
Constantine
by Contributor III
  • 1938 Views
  • 1 replies
  • 2 kudos

Do we have delta table access logs ?

I have delta tables on databricks with AWS s3.Are there any logs or anything else to figure out who all are accessing a particular DB or tables.

  • 1938 Views
  • 1 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

The thing that comes closest are Audit logs. Here is a list of log triggers.

  • 2 kudos
MohitAnchlia
by New Contributor II
  • 1078 Views
  • 0 replies
  • 1 kudos

Change AWS storage setting and account

I am seeing a super weird behaviour in databricks. We initially configured the following: 1. Account X in Account Console -> AWS Account arn:aws:iam::X:role/databricks-s3 2. We setup databricks-s3 as S3 bucket in Account Console -> AWS Storage 3. W...

  • 1078 Views
  • 0 replies
  • 1 kudos
Labels