cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Dave_Nithio
by Contributor
  • 4952 Views
  • 4 replies
  • 2 kudos

Resolved! How to use autoloader with csv containing spaces in attribute names?

I am attempting to use autoloader to add a number of csv files to a delta table. The underlying csv files have spaces in the attribute names though (i.e. 'Account Number' instead of 'AccountNumber'). When I run my autoload, I get the following error ...

  • 4952 Views
  • 4 replies
  • 2 kudos
Latest Reply
Dave_Nithio
Contributor
  • 2 kudos

@Hubert Dudek​ thanks for your response! I was able to use what you proposed above to generate the schema. The issue is that the schema sets all attributes to STRING values and renames them numerically ('_c0', '_c1', etc.). Although this allows us to...

  • 2 kudos
3 More Replies
Soma
by Valued Contributor
  • 3148 Views
  • 6 replies
  • 3 kudos

Resolved! Dynamically supplying partitions to autoloader

We are having a streaming use case and we see a lot of time in listing from azure.Is it possible to supply partition to autoloader dynamically on the fly

  • 3148 Views
  • 6 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@somanath Sankaran​ - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

  • 3 kudos
5 More Replies
FabriceDeseyn
by Contributor
  • 6380 Views
  • 6 replies
  • 6 kudos

Resolved! What does autoloader's cloudfiles.backfillInterval do?

I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection ...

image
  • 6380 Views
  • 6 replies
  • 6 kudos
Latest Reply
822025
New Contributor II
  • 6 kudos

If we set the backfill to 1 week, will it run only 1ce a week or rather it will look for old files not processed in every trigger ?For eg :- if we set it to 1 day and the job runs every hour, then will it look for files in past 24 hours on a sliding ...

  • 6 kudos
5 More Replies
MRTN
by New Contributor III
  • 4100 Views
  • 3 replies
  • 2 kudos

Resolved! Configure multiple source paths for auto loader

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

  • 4100 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Morten Stakkeland​ :Yes, it's possible to configure an autoloader to read from multiple locations.You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since ...

  • 2 kudos
2 More Replies
herry
by New Contributor III
  • 2980 Views
  • 4 replies
  • 4 kudos

Resolved! Get the list of loaded files from Autoloader

Hello,We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?I can easily do this in AWS Glue j...

  • 2980 Views
  • 4 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Herry Ramli​ - Would you be happy to mark Hubert's answer as best so that other members can find the solution more easily?Thanks!

  • 4 kudos
3 More Replies
Maksym
by New Contributor III
  • 7349 Views
  • 4 replies
  • 7 kudos

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark .readStream .format("cloudFiles") .option('cloudFil...

  • 7349 Views
  • 4 replies
  • 7 kudos
Latest Reply
lassebe
New Contributor II
  • 7 kudos

I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!

  • 7 kudos
3 More Replies
fhmessas
by New Contributor II
  • 1695 Views
  • 2 replies
  • 2 kudos

Trigger.AvailableNow getting stuck when there is no event

Hi, I have several streaming jobs, however one of them uses the Trigger.AvailableNow. The issue is that it gets stuck when there is no events or finishes ingesting all events. The expected behavior would be the job being shutdown.I've already checked...

Stuck streaming
  • 1695 Views
  • 2 replies
  • 2 kudos
Latest Reply
fhmessas
New Contributor II
  • 2 kudos

Hi, the source is an S3 bucket using file notification with SQS.No errors or warns in the logs, the AvailableNow trigger just gets stuck.

  • 2 kudos
1 More Replies
sanjay
by Valued Contributor II
  • 3923 Views
  • 3 replies
  • 2 kudos

Resolved! Autoloader maxFilesPerTrigger not working correctly

Hi,am trying to apply batch size in autoloader and code is as below. But its picking all the changes in one go even if I have put maxFilesPerTrigger as 10. Appreciate any help.(spark.readStream.format("json").schema(streamSchema).option("cloudFiles.b...

  • 3923 Views
  • 3 replies
  • 2 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 2 kudos

Hi @Sanjay Jain​ , Since you have provided the trigger as once, the maxFilesPerTrigger will not take effect here. With trigger once, all the files will be read together. You need to change the trigger for this option to come into effect.Please refer ...

  • 2 kudos
2 More Replies
sanjay
by Valued Contributor II
  • 1554 Views
  • 2 replies
  • 1 kudos

Resolved! How can I prioritize message in autoloader

Hi,I am using autoloader, it picks data from AWS S3 and stores in delta table. In case there are large number of messages, I like to process messages by priority. Is it possible to prioritize messages in autoloader.Regards,Sanjay

  • 1554 Views
  • 2 replies
  • 1 kudos
Latest Reply
sanjay
Valued Contributor II
  • 1 kudos

Thank you Sandeep. Other option is I can keep messages in 2 different folders in S3. Can autoloader read message from multiple folders

  • 1 kudos
1 More Replies
Enzo_Bahrami
by New Contributor III
  • 2670 Views
  • 2 replies
  • 0 kudos

Resolved! Input File Path from Autoloader in Delta Live Tables

Hello everyone!I was wondering if there is any way to get the subdirectories in which the file resides while loading while loading using Autoloader with DLT. For example:def customer(): return (  spark.readStream.format('cloudfiles')    .option('clou...

  • 2670 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Parsa Bahraminejad​ We haven't heard from you since the last response from @Vigneshraja Palaniraj​ â€‹, and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be...

  • 0 kudos
1 More Replies
Veeru245
by New Contributor
  • 922 Views
  • 0 replies
  • 0 kudos

Autoloader Solution for Binary files

We have solution implemented for ingesting binary file ( .ZIP ) into delta lake, Currently we are using the below solution within our pipeline.Unzip the file and extract the XML file.Parse the XML using python libraries.Flatten the nested xml columns...

  • 922 Views
  • 0 replies
  • 0 kudos
pvignesh92
by Honored Contributor
  • 1444 Views
  • 1 replies
  • 3 kudos

lnkd.in

Databricks Auto Loader is an interesting feature that can be used to load data incrementally.✳ It can process new data files as they arrive in the cloud object stores✳ It can be used to ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT and even Binary file ...

  • 1444 Views
  • 1 replies
  • 3 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 3 kudos

Thanks for sharing

  • 3 kudos
Rishitha
by New Contributor III
  • 1589 Views
  • 2 replies
  • 2 kudos

Resolved! Normalizing data from autoloader

I have data on s3 and i'm using autoloader to load the data. My json docs have fields which are array of structures. When I don't specify any schema the whole data is stored as strings even the array of structures are just a blob of string making it ...

  • 1589 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Rishitha Reddy​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us s...

  • 2 kudos
1 More Replies
sanjay
by Valued Contributor II
  • 7457 Views
  • 0 replies
  • 0 kudos

autoloader with real time and batch processing concurrently

Hi,I have data pipeline which is running continuously, processes the micro batch data and store data in delta lake. This is taking care of any new data.But at times, I need to process historical data without disturbing real time data processing.Is th...

  • 7457 Views
  • 0 replies
  • 0 kudos
jhgorse
by New Contributor III
  • 1783 Views
  • 2 replies
  • 1 kudos

Resolved! autoloader break on migration from community to trial premium with s3 mount

in dbx community edition, the autoloader works using the s3 mount. s3 mount, autoloader:dbutils.fs.mount(f"s3a://{access_key}:{encoded_secret_key}@{aws_bucket_name}", f"/mnt/{mount_name}from pyspark.sql import SparkSession from pyspark.sql.functions ...

  • 1783 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Joe Gorse​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

  • 1 kudos
1 More Replies
Labels