cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

FabriceDeseyn
by Contributor
  • 8204 Views
  • 6 replies
  • 6 kudos

Resolved! What does autoloader's cloudfiles.backfillInterval do?

I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection ...

image
  • 8204 Views
  • 6 replies
  • 6 kudos
Latest Reply
822025
New Contributor II
  • 6 kudos

If we set the backfill to 1 week, will it run only 1ce a week or rather it will look for old files not processed in every trigger ?For eg :- if we set it to 1 day and the job runs every hour, then will it look for files in past 24 hours on a sliding ...

  • 6 kudos
5 More Replies
logan0015
by Contributor
  • 4924 Views
  • 6 replies
  • 4 kudos

Resolved! Getting a key mismatch error with Delta Live Tables.

I am attempting to create a streaming delta live table. The main issue I am experiencing is the error below.com.databricks.sql.cloudfiles.errors.CloudFilesIllegalStateException: Found mismatched event: keyI have an aws appflow that is creating a fold...

  • 4924 Views
  • 6 replies
  • 4 kudos
Latest Reply
VijaC_97468
New Contributor II
  • 4 kudos

Hi, I am also facing the same issue, but I found nothing on the documentation to fix it.

  • 4 kudos
5 More Replies
MRTN
by New Contributor III
  • 1593 Views
  • 1 replies
  • 1 kudos

Columns archive_time, commit_time, archive_time always NULL when running cloud_files_state

Am attempting to find the commit_time for a given file for a delta table using the cloud_files_state command. However, the archive_time, commit_time, and archive_time coluns are always NULL. I am running databrics runtime 11.3 and have also verified ...

cloud_files_state
  • 1593 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Morten Stakkeland​ :The issue you are facing with the cloud_files_state command is a known limitation in Delta Lake as of the latest stable release (Delta Lake 1.0). The commit_time and protocol columns are always null, and the archive_time column i...

  • 1 kudos
Ria
by New Contributor
  • 1423 Views
  • 1 replies
  • 1 kudos

py4j.security.Py4JSecurityException

Getting this error while loading data with autoloader. Although table access control is already disabled still getting this error."py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql...

image
  • 1423 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi,Are you using a High concurrency cluster? which DBR version are you running?

  • 1 kudos
Malcoln_Dandaro
by New Contributor
  • 1767 Views
  • 0 replies
  • 0 kudos

Is there any way to navigate/access cloud files using the direct abfss URI (no mount) with default python functions/libs like open() or os.listdir()?

Hello, Today on our workspace we access everything via mount points, we plan to change it to "abfss://" because of security, governance and performance reasons. The problem is sometimes we interact with files using "python only" code, and apparently ...

  • 1767 Views
  • 0 replies
  • 0 kudos
tej1
by New Contributor III
  • 3926 Views
  • 5 replies
  • 7 kudos

Resolved! Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles. And it is necessary to access the file modification timestamp of the file. As documented here, we tried selecting `_metadata` column in a task in delta live p...

  • 3926 Views
  • 5 replies
  • 7 kudos
Latest Reply
tej1
New Contributor III
  • 7 kudos

Update: We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.

  • 7 kudos
4 More Replies
Michael_Galli
by Contributor III
  • 4683 Views
  • 3 replies
  • 2 kudos

Resolved! Spark Streaming - only process new files in streaming path?

In our streaming jobs, we currently run streaming (cloudFiles format) on a directory with sales transactions coming every 5 minutes.In this directory, the transactions are ordered in the following format:<streaming-checkpoint-root>/<transaction_date>...

  • 4683 Views
  • 3 replies
  • 2 kudos
Latest Reply
Michael_Galli
Contributor III
  • 2 kudos

Update:Seems that maxFileAge was not a good idea. The following with the option "includeExistingFiles" = False solved my problem:streaming_df = ( spark.readStream.format("cloudFiles") .option("cloudFiles.format", extension) .option("...

  • 2 kudos
2 More Replies
Labels