cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Larrio
by New Contributor III
  • 7553 Views
  • 6 replies
  • 3 kudos

Autoloader - understanding missing file after schema update.

Hello,Concerning Autoloader (based on https://docs.databricks.com/ingestion/auto-loader/schema.html), so far what I understand is when it detects a schema update, the stream fails and I have to rerun it to make it works, it's ok.But once I rerun it, ...

  • 7553 Views
  • 6 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hi @Lucien Arrio​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thank...

  • 3 kudos
5 More Replies
Chhaya
by New Contributor III
  • 4707 Views
  • 6 replies
  • 2 kudos

Using great expectations with autolaoder

Hi everyone ,I have implemented a data pipeline using autoloader bronze-->silver-->gold .now while I do this I want to perform some data quality checks , and for that I'm using great expectations library.However I'm stuck with below error when trying...

  • 4707 Views
  • 6 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Chhaya Vishwakarma​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your fe...

  • 2 kudos
5 More Replies
FabriceDeseyn
by Contributor
  • 7854 Views
  • 5 replies
  • 1 kudos

Resolved! Autoloader directory listing not listing all files

Hi communityI have an Autoloader pipeline running with following configuration. Unfortunately, it does not detect all files. (see below query definition). The folder that needs to be read has 38.246 files that all have the same schema and structure.:...

image.png image.png image.png image.png
  • 7854 Views
  • 5 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Fabrice Deseyn​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answer...

  • 1 kudos
4 More Replies
MerelyPerfect
by New Contributor II
  • 3818 Views
  • 3 replies
  • 1 kudos

read base64 json column with Autoloader and inferschema.

I have json files falling in our blob with two fields, 1. offset(integer), 2. value(base64).This value column is json with unicode. so they sent it as base64. Challenge is this json is very large with 100+ fields. so we cannot define the schema. We c...

  • 3818 Views
  • 3 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @MerelyPerfect Per​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you....

  • 1 kudos
2 More Replies
Mr__D
by New Contributor II
  • 7198 Views
  • 2 replies
  • 3 kudos

Do we really need Autoloader for batch processing.?

Hi All,It seem AutoLoader is good option for even driven data ingestion but if my job runs only once , do I still need autoloader ? I dont want to spend money to spin a cluster whole day.I know we have RunOnce option available while running a job but...

  • 7198 Views
  • 2 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hi @Deepak Bhatt​ Help us build a vibrant and resourceful community by recognizing and highlighting insightful contributions. Mark the best answers and show your appreciation!Thanks and regards

  • 3 kudos
1 More Replies
Tico23
by Contributor
  • 4908 Views
  • 3 replies
  • 0 kudos

Resolved! AmazonS3 with Autoloader consume "too many" requests or maybe not!

After successfully loading 3 small files (2 KB each) in from AWS S3 using Auto Loader for learning purposes, I got, few hours later, a "AWS Free tier limit alert", although I haven't used the AWS account for a while.Does this streaming service on Da...

Budget_alert
  • 4908 Views
  • 3 replies
  • 0 kudos
Latest Reply
Debayan
Databricks Employee
  • 0 kudos

Hi, ​​Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from AWS S3 (s3://), Azure Data Lake Storage Gen2 (ADLS Gen2, abfss://), Google Cloud Storage (GCS, gs://), Azur...

  • 0 kudos
2 More Replies
brickster
by New Contributor II
  • 7351 Views
  • 3 replies
  • 0 kudos

How to trigger workflow job tasks from Autoloader

I have configured a File Notification Autoloader that monitors S3 bucket for binary files. I want to integrate autoloader with workflow job so that whenever a file is placed in S3 bucket, the pipeline job notebook tasks can pick-up new file and start...

  • 7351 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Saravanan Ponnaiah​ Hope everything is going great.Does @odoll odoll​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

  • 0 kudos
2 More Replies
sanjay
by Valued Contributor II
  • 3192 Views
  • 4 replies
  • 1 kudos

Resolved! How can I get date when autoloader processes the file

Hi,I am running autoloader which is running continuously and checks for new file every 1 minute. I need to store when file was received/processed but its giving me date when autoloader started. Here is my code.df = (spark   .readStream   .format("clo...

  • 3192 Views
  • 4 replies
  • 1 kudos
Latest Reply
Lakshay
Databricks Employee
  • 1 kudos

Hi @Sanjay Jain​ , You can use the File Metadata column functionality to collect that information.Ref doc:- https://docs.databricks.com/ingestion/file-metadata-column.html

  • 1 kudos
3 More Replies
Ria
by New Contributor
  • 1533 Views
  • 1 replies
  • 1 kudos

py4j.security.Py4JSecurityException

Getting this error while loading data with autoloader. Although table access control is already disabled still getting this error."py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql...

image
  • 1533 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi,Are you using a High concurrency cluster? which DBR version are you running?

  • 1 kudos
chanansh
by Contributor
  • 1649 Views
  • 1 replies
  • 0 kudos

Resolved! autoloader documentation does not work

I am trying to following the documentation here:https://learn.microsoft.com/en-us/azure/databricks/getting-started/etl-quick-startMy code looks like:(spark.readStream .format("cloudFiles") .option("header", "true") #.option("cloudFiles.partitio...

  • 1649 Views
  • 1 replies
  • 0 kudos
Latest Reply
Murthy1
Contributor II
  • 0 kudos

Hi,It seems like you are writing to a path which is not empty and has some non - delta format files. Also, can you confirm if the path mentioned in the error message "`s3://nbu-ml/projects/rca/msft/dsm09collectx/delta` " is the path you are writing t...

  • 0 kudos
rakeshprasad1
by New Contributor III
  • 3389 Views
  • 3 replies
  • 4 kudos

databricks autoloader not updating table immediately

I have a simple autoloader job which looks like thisdf_dwu_limit = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "JSON") \ .schema(schemaFromJson) \ .load("abfss://synapse-usage@xxxxx.dfs.core.windows.net/synapse-us...

auto-loader issue
  • 3389 Views
  • 3 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

Can you share the whole code with the counts, which you mentioned?

  • 4 kudos
2 More Replies
chanansh
by Contributor
  • 6825 Views
  • 9 replies
  • 9 kudos

copy files from azure to s3

I am trying to copy files from azure to s3. I've created a solution by comparing file lists and copy manually to a temp file and upload. However, I just found AutoLoader and I would like to use that https://docs.databricks.com/ingestion/auto-loader/i...

  • 6825 Views
  • 9 replies
  • 9 kudos
Latest Reply
Falokun
New Contributor II
  • 9 kudos

Just use tools like Goodsync and Gs Richcopy 360 to copy directly from blob to S3, I think you will never face problems like that ​

  • 9 kudos
8 More Replies
ks1248
by New Contributor III
  • 2997 Views
  • 2 replies
  • 5 kudos

Resolved! Autoloader creates columns not present in the source

I have been exploring Autoloader to ingest gzipped JSON files from an S3 source.The notebook fails in the first run due to schema mismatch, after re-running the notebook, the schema evolves and the ingestion runs successfully.On analysing the schema ...

  • 2997 Views
  • 2 replies
  • 5 kudos
Latest Reply
ks1248
New Contributor III
  • 5 kudos

Hi @Debayan Mukherjee​ , @Kaniz Fatma​ Thank you for replying to my question.I was able to figure out the issue. I was creating the schema and checkpoint folders in the same path as the source location for the autoloader. This caused the schema to ch...

  • 5 kudos
1 More Replies
alxsbn
by Contributor
  • 2596 Views
  • 2 replies
  • 2 kudos

Resolved! Autloader on CSV file didn't infer well cell with JSON data

Hello ! I playing with autoloader schema inference on a big S3 repo with +300 tables and large CSV files. I'm looking at autoloader with great attention, as it can be a great time saver on our ingestion process (data comes from a transactional DB gen...

  • 2596 Views
  • 2 replies
  • 2 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 2 kudos

PySpark by default is using \ as an escape character. You can change it to "Doc: https://docs.databricks.com/ingestion/auto-loader/options.html#csv-options

  • 2 kudos
1 More Replies
AK032716
by New Contributor
  • 3528 Views
  • 2 replies
  • 2 kudos

implement autoloader to ingest data into delta lake, i have 100 different tables with full load , append merge senarios

i want to implement autoloader to ingest data into delta lake from 5 different source systems and i have 100 different tables in each database how do we dynamically address this by using autoloader , trigger once option - full load , append merge sen...

  • 3528 Views
  • 2 replies
  • 2 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 2 kudos

You can create a generic notebook that will be parametrized with the table name/source system and then just simply trigger notebook with different parameters (for each table/source system).For parametrization you can use dbutils.widgets (https://docs...

  • 2 kudos
1 More Replies
Labels