cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

shivam-singh
by New Contributor
  • 1330 Views
  • 1 replies
  • 0 kudos

Databricks-Autoloader-S3-KMS

Hi, I am working on a requirement where I am using autoloader in a DLT pipeline to ingest new files as they come.This flow is working fine. However I am facing an issue, when we have the source bucket an s3 location, since the bucket is having a SSE-...

  • 1330 Views
  • 1 replies
  • 0 kudos
Latest Reply
kulkpd
Contributor
  • 0 kudos

Can you please paste the exact errors and check below things:check following if its related to KMS:1. IAM role policy and KMS policy should have allow permissions2. Did you use extraConfig while mounting the source-s3 bucket:If you have used IAM role...

  • 0 kudos
esalohs
by New Contributor III
  • 10137 Views
  • 6 replies
  • 4 kudos

Databricks Autoloader - list only new files in an s3 bucket/directory

I have an s3 bucket with a couple of subdirectories/partitions like s3a://Bucket/dir1/ and s3a://Bucket/dir2/. There is currently in the millions of files sitting in bucket in the various subdirectories/partitions. I'm getting new data in near real t...

  • 10137 Views
  • 6 replies
  • 4 kudos
Latest Reply
kulkpd
Contributor
  • 4 kudos

below option used while performing spark.readStream:::.option('cloudFiles.format', 'json').option('cloudFiles.inferColumnTypes', 'true').option('cloudFiles.schemaEvolutionMode', 'rescue').option('cloudFiles.useNotifications', True).option('skipChange...

  • 4 kudos
5 More Replies
Muhammed
by New Contributor III
  • 28196 Views
  • 13 replies
  • 0 kudos

Filtering files for query

Hi Team,While writing my data to datalake table I am getting 'filtering files for query', it would be stuck at writingHow can I resolve this issue

  • 28196 Views
  • 13 replies
  • 0 kudos
Latest Reply
kulkpd
Contributor
  • 0 kudos

My bad, somewhere in the screenshot I saw that but not able to find it now.Which source you are using to load the data, delta table, aws-s3, or azure-storage?

  • 0 kudos
12 More Replies
geetha_venkates
by New Contributor II
  • 12591 Views
  • 7 replies
  • 2 kudos

Resolved! How do we add a certificate file in Databricks for sparksubmit type of job?

How do we add a certificate file in Databricks for sparksubmit type of job? 

  • 12591 Views
  • 7 replies
  • 2 kudos
Latest Reply
nicozambelli
New Contributor II
  • 2 kudos

I have the same problem... when i worked with the hive_metastore in past, i was able tu use file system and also use API certs.Now i'm using the unity catalog and i can't upload a certificate, can somebody help me?

  • 2 kudos
6 More Replies
RobinK
by Contributor
  • 17998 Views
  • 5 replies
  • 6 kudos

Resolved! How to set Python rootpath when deploying with DABs

We have structured our code according to the documentation (notebooks-best-practices). We use Jupyter notebooks and have outsourced logic to Python modules. Unfortunately, the example described in the documentation only works if you have checked out ...

  • 17998 Views
  • 5 replies
  • 6 kudos
Latest Reply
Corbin
Databricks Employee
  • 6 kudos

Hello Robin, You’ll have to either use wheel files to package your libs and use those (see docs here), to make imports work out of the box. Otherwise, your entry point file needs to add the bundle root directory (or whatever the lib directory is) to ...

  • 6 kudos
4 More Replies
Kumarashokjmu
by New Contributor II
  • 5532 Views
  • 4 replies
  • 0 kudos

need to ingest millions of csv files from aws s3

I have a need to ingest millions of csv files from aws s3 bucket. I am facing issue with aws s3 throttling issue and besides notebook process is running for 8 hours plus and sometimes failing. When looking at cluster performance, it is utilized 60%.I...

  • 5532 Views
  • 4 replies
  • 0 kudos
Latest Reply
kulkpd
Contributor
  • 0 kudos

If you want to load all the data at once use autoloader or DLT pipeline with directory listing if files are lexically ordered. ORIf you want to perform incremental load, divide the load into two job like historic data load vs live data load:Live data...

  • 0 kudos
3 More Replies
leelee3000
by Databricks Employee
  • 1220 Views
  • 0 replies
  • 0 kudos

Dynamic Filtering Criteria for Data Streaming

One of the potential uses for DLT is a scenario where I have a large input stream of data and need to create multiple smaller streams based on dynamic and adjustable filtering criteria. The challenge is to allow non-engineering individuals to adjust ...

  • 1220 Views
  • 0 replies
  • 0 kudos
leelee3000
by Databricks Employee
  • 1734 Views
  • 0 replies
  • 0 kudos

Parameterizing DLT Jobs

I have observed the use of advanced configuration and creating a map as a way to parameterize notebooks, but these appear to be cluster-wide settings. Is there a recommended best practice for directly passing parameters to notebooks running on a DLT ...

  • 1734 Views
  • 0 replies
  • 0 kudos
Geoff
by New Contributor II
  • 1804 Views
  • 0 replies
  • 1 kudos

Bizarre Delta Tables pipeline error: ModuleNotFound

I received the following error when trying to import a function defined in a .py file into a .ipynb file. I would add code blocks, but the message keeps getting rejected for invalid HTML.# test_lib.py (same directory, in a subfolder)def square(x):ret...

  • 1804 Views
  • 0 replies
  • 1 kudos
pankz-104
by New Contributor
  • 2011 Views
  • 1 replies
  • 0 kudos

how to read deleted files in adls

We have soft delete enabled in adls for 3 days, And we have manually deleted some checkpoint files size 3 tb approx. Each file is just couple of bytes like 30 b, 40 b. The deleted file size is increasing day by day even after couple of days. Suppose ...

  • 2011 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Hi @pankz-104 , Just a friendly follow-up. Did you have time to test Kaniz's recommendations? do you still have issues? please let us know

  • 0 kudos
Chris_sh
by New Contributor II
  • 1238 Views
  • 1 replies
  • 0 kudos

DLT Missing Select tables button or Enhancement Request?

Currently when a Delta Live Table fails due to an error the option to select specific tables to run a full refresh is removed. This seems like an error. A full refresh can fix an error that might be caused and you should always be able to select to d...

Chris_sh_0-1700066428625.png Chris_sh_2-1700066503115.png
  • 1238 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Hi @Chris_sh, which DLT channel are you using? 

  • 0 kudos
Rajaniesh
by New Contributor III
  • 3619 Views
  • 2 replies
  • 1 kudos

URGENT HELP NEEDED: Python functions deployed in the cluster throwing the error

Hi,I have created a python wheel with the following code. And the package name is rule_engine"""The entry point of the Python Wheel"""import sysfrom pyspark.sql.functions import expr, coldef get_rules(tag): """  loads data quality rules from a table ...

  • 3619 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

You can find more details and examples here https://docs.databricks.com/en/workflows/jobs/how-to/use-python-wheels-in-workflows.html#use-a-python-wheel-in-a-databricks-job

  • 1 kudos
1 More Replies
dcc
by New Contributor
  • 8551 Views
  • 1 replies
  • 0 kudos

DBT Jobs || API call returns "Internal Error"

Hey there,I am currently using the Databricks API to trigger a specific DBT job. For this, I am calling the API in a Web Activity on Azure datafactory and sending as headers the token and for the body I am sending the Job ID and the necessary vars I ...

  • 8551 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Could you please share the driver logs? it will help us to narrow down the issue

  • 0 kudos
chari
by Contributor
  • 16869 Views
  • 4 replies
  • 2 kudos

Resolved! Connect to data in one drive to Azure Databricks

Hello,A colleague of mine previously built a data pipeline for connecting data available on share point (one drive), coded in python in jupyter notebook. Now, its my job to transfer the code to Azure databricks and I am unable to connect/download thi...

  • 16869 Views
  • 4 replies
  • 2 kudos
Latest Reply
gabsylvain
Databricks Employee
  • 2 kudos

@chari Also you ingest both Sharepoint and OneDrive data directly into Databricks using Partner Connect. You can refer to the documentation bellow for more information: Connect to Fivetran using Partner Connect Fivetran Sharepoint Connector Documenta...

  • 2 kudos
3 More Replies
rvo1994
by New Contributor
  • 7794 Views
  • 0 replies
  • 0 kudos

Performance issue with spatial reference system conversions

Hi,I am facing a performance issue with spatial reference system conversions. My delta table has approximately 10 GB/46 files/160M records and gets +/- 5M records every week. After ingestion, I need to convert points (columns GE_XY_XCOR and GE_XY_YCO...

  • 7794 Views
  • 0 replies
  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels