Data Engineering

Forum Posts

Sorted by:

by FabriceDeseyn • Contributor

03-15-2023 2:52:16 AM

8204 Views
6 replies
6 kudos

Resolved! What does autoloader's cloudfiles.backfillInterval do?

I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection ...

Data Engineering

8204 Views
6 replies
6 kudos

03-15-2023 2:52:16 AM

View Replies

Latest Reply

822025
New Contributor II

08-30-2024 8:19:18 AM

6 kudos

If we set the backfill to 1 week, will it run only 1ce a week or rather it will look for old files not processed in every trigger ?For eg :- if we set it to 1 day and the job runs every hour, then will it look for files in past 24 hours on a sliding ...

6 kudos

08-30-2024 8:19:18 AM

5 More Replies

by logan0015 • Contributor

09-20-2022 2:30:51 PM

4924 Views
6 replies
4 kudos

Resolved! Getting a key mismatch error with Delta Live Tables.

I am attempting to create a streaming delta live table. The main issue I am experiencing is the error below.com.databricks.sql.cloudfiles.errors.CloudFilesIllegalStateException: Found mismatched event: keyI have an aws appflow that is creating a fold...

Data Engineering

4924 Views
6 replies
4 kudos

09-20-2022 2:30:51 PM

View Replies

Latest Reply

VijaC_97468
New Contributor II

05-02-2023 5:06:09 AM

4 kudos

Hi, I am also facing the same issue, but I found nothing on the documentation to fix it.

4 kudos

05-02-2023 5:06:09 AM

5 More Replies

by MRTN • New Contributor III

04-27-2023 12:31:46 AM

1593 Views
1 replies
1 kudos

Columns archive_time, commit_time, archive_time always NULL when running cloud_files_state

Am attempting to find the commit_time for a given file for a delta table using the cloud_files_state command. However, the archive_time, commit_time, and archive_time coluns are always NULL. I am running databrics runtime 11.3 and have also verified ...

Data Engineering

1593 Views
1 replies
1 kudos

04-27-2023 12:31:46 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-28-2023 10:57:11 AM

1 kudos

@Morten Stakkeland :The issue you are facing with the cloud_files_state command is a known limitation in Delta Lake as of the latest stable release (Delta Lake 1.0). The commit_time and protocol columns are always null, and the archive_time column i...

1 kudos

04-28-2023 10:57:11 AM

by Ria • New Contributor

02-09-2023 10:19:06 PM

1423 Views
1 replies
1 kudos

py4j.security.Py4JSecurityException

Getting this error while loading data with autoloader. Although table access control is already disabled still getting this error."py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql...

Data Engineering

1423 Views
1 replies
1 kudos

02-09-2023 10:19:06 PM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

02-22-2023 2:17:47 PM

1 kudos

Hi,Are you using a High concurrency cluster? which DBR version are you running?

1 kudos

02-22-2023 2:17:47 PM

by Malcoln_Dandaro • New Contributor

08-20-2022 1:45:21 PM

1767 Views
0 replies
0 kudos

Is there any way to navigate/access cloud files using the direct abfss URI (no mount) with default python functions/libs like open() or os.listdir()?

Hello, Today on our workspace we access everything via mount points, we plan to change it to "abfss://" because of security, governance and performance reasons. The problem is sometimes we interact with files using "python only" code, and apparently ...

Data Engineering

1767 Views
0 replies
0 kudos

08-20-2022 1:45:21 PM

by tej1 • New Contributor III

05-16-2022 10:07:45 AM

3926 Views
5 replies
7 kudos

Resolved! Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles. And it is necessary to access the file modification timestamp of the file. As documented here, we tried selecting `_metadata` column in a task in delta live p...

Data Engineering

3926 Views
5 replies
7 kudos

05-16-2022 10:07:45 AM

View Replies

Latest Reply

tej1
New Contributor III

08-03-2022 5:54:25 AM

7 kudos

Update: We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.

7 kudos

08-03-2022 5:54:25 AM

4 More Replies

by Michael_Galli • Contributor III

05-06-2022 4:19:28 AM

4683 Views
3 replies
2 kudos

Resolved! Spark Streaming - only process new files in streaming path?

In our streaming jobs, we currently run streaming (cloudFiles format) on a directory with sales transactions coming every 5 minutes.In this directory, the transactions are ordered in the following format:<streaming-checkpoint-root>/<transaction_date>...

Data Engineering

4683 Views
3 replies
2 kudos

05-06-2022 4:19:28 AM

View Replies

Latest Reply

Michael_Galli
Contributor III

05-09-2022 11:00:26 PM

2 kudos

Update:Seems that maxFileAge was not a good idea. The following with the option "includeExistingFiles" = False solved my problem:streaming_df = ( spark.readStream.format("cloudFiles") .option("cloudFiles.format", extension) .option("...

2 kudos

05-09-2022 11:00:26 PM

2 More Replies

Databricks Community

Resolved! What does autoloader's cloudfiles.backfillInterval do?

Resolved! Getting a key mismatch error with Delta Live Tables.

Columns archive_time, commit_time, archive_time always NULL when running cloud_files_state

py4j.security.Py4JSecurityException

Is there any way to navigate/access cloud files using the direct abfss URI (no mount) with default python functions/libs like open() or os.listdir()?

Resolved! Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

Resolved! Spark Streaming - only process new files in streaming path?