cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Soumik
by New Contributor II
  • 2360 Views
  • 2 replies
  • 1 kudos

#N/A value is coming as null/NaN while using pandas.read_excel

Hi All,I am trying to read an input_file.xlsx file using pandas.read_excel. I am using the below option import pandas as pddf = pd.read_excel(input_file, sheetname = sheetname, dtype = str, na_filter= False, keep_default_na = FalseNot sure but the va...

  • 2360 Views
  • 2 replies
  • 1 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 1 kudos

Hi Soumik,How are you doing today? As per my understanding, It looks like Pandas is still treating #N/A as a missing value because Excel considers it a special type of NA. Even though you've set na_filter=False and keep_default_na=False, Pandas might...

  • 1 kudos
1 More Replies
dyusuf
by New Contributor II
  • 1599 Views
  • 2 replies
  • 0 kudos

Data Skewnesss

I am trying to visualize data skewness through a simple aggregation example by performing groupby operation on a dataframe, the data is skewed highly for one customer, but yet databricks is balancing it automatically when I check spark UI. Is there a...

  • 1599 Views
  • 2 replies
  • 0 kudos
Latest Reply
SantoshJoshi
New Contributor III
  • 0 kudos

Hi @dyusuf ,It could be because AQE (Adaptive Query Execution) is enabled....AQE, dynamically handles skew...Please refer below link for more details:https://docs.databricks.com/aws/en/optimizations/aqeCan you please disable AQE and check if this wor...

  • 0 kudos
1 More Replies
Kutbuddin
by New Contributor III
  • 1234 Views
  • 1 replies
  • 0 kudos

Random failures with serverless compute running dbt jobs

We recently encountered the below issue, where a databricks job configured to run a dbt task on serverless compute and warehouse failed due to python dependency failure:run failed with error message Library installation failed: Library installation ...

  • 1234 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi,How are you doing today?, As per my understanding, This kind of random failure is usually due to network issues, temporary package repository problems, or how serverless compute handles dependencies. Since serverless clusters are short-lived and s...

  • 0 kudos
SObiero
by New Contributor
  • 1245 Views
  • 2 replies
  • 0 kudos

Databricks App Error libodbc.so.2: cannot open shared object file: No such file or directory

How do I solve this error in my Databricks Apps when using the pyodbc library? I have used an init script to install the library in my cluster, which has resolved the issue in Notebooks. However, the problem persists in Apps. I have used the followin...

  • 1245 Views
  • 2 replies
  • 0 kudos
Latest Reply
mourinseoexpart
New Contributor II
  • 0 kudos

Your analysis is spot on! The issue likely stems from environment differences between Notebooks and Apps. Checking cluster consistency, verifying and setting should help. Also, reviewing permissions and alternative installations might be necessary. L...

  • 0 kudos
1 More Replies
NathanE
by New Contributor II
  • 9753 Views
  • 7 replies
  • 10 kudos

Java 21 support with Databricks JDBC driver

Hello,I was wondering if there was any timeline for Java 21 support with the Databricks JDBC driver (current version is 2.34).One of the required change is to update the dependency to arrow to version 13.0 (current version is 9.0.0).The current worka...

Data Engineering
driver
java21
JDBC
  • 9753 Views
  • 7 replies
  • 10 kudos
Latest Reply
yunbodeng
Databricks Employee
  • 10 kudos

You can download the latest Databricks JDBC here (https://www.databricks.com/spark/jdbc-drivers-download) in which the latest Arrow version is 17 (https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/simbaspark-drivers/jdbc/2.7.1/docs/release-n...

  • 10 kudos
6 More Replies
jrod123
by New Contributor II
  • 1845 Views
  • 4 replies
  • 1 kudos

Simple append for a DLT

Looking for some help getting unstuck re: appending to DLTs in Databricks. I have successfully extracted data via API endpoint, done some initial data cleaning/processing, and subsequently stored that data in a DLT. Great start. But I noticed that ea...

  • 1845 Views
  • 4 replies
  • 1 kudos
Latest Reply
tastefulSamurai
New Contributor II
  • 1 kudos

I am likewise struggling with this. All DLT configurations that I've tried (including spark_conf={"pipelines.autoOptimize.appendOnly": "true"}) just yield overwrites of the existing data. Any luck @jrod123 

  • 1 kudos
3 More Replies
AyushModi038
by New Contributor III
  • 13243 Views
  • 8 replies
  • 10 kudos

Library installation in cluster taking a long time

I am trying to install "pycaret" libraray in cluster using whl file.But it is creating conflict in the dependency sometimes (not always, sometimes it works too.) ​My questions are -1 - How to install libraries in cluster only single time (Maybe from ...

  • 13243 Views
  • 8 replies
  • 10 kudos
Latest Reply
Spencer_Kent
New Contributor III
  • 10 kudos

@Retired_modWhat about question #1, which is what subsequent comments to this thread have been referring to? To recap the question: is it possible for "cluster-installed" libraries to be cached in such a way that they aren't completely reinstalled ev...

  • 10 kudos
7 More Replies
shervinmir
by New Contributor II
  • 5347 Views
  • 4 replies
  • 0 kudos

Using user-assigned managed identity inside notebook

Hi team,I am interested in using a user-assigned managed identity within my notebook. I've come across examples using system-assigned managed identities or leveraging the Access Connector for Azure Databricks via Unity Catalog. However, as I do not h...

  • 5347 Views
  • 4 replies
  • 0 kudos
Latest Reply
shervinmir
New Contributor II
  • 0 kudos

Hi team, Just wondering if anyone has any suggestions. We are still unable to use User Assigned managed identity inside the a notebook in Databricks to connect to a external Gen 2 storage 

  • 0 kudos
3 More Replies
Prashanth24
by New Contributor III
  • 11737 Views
  • 8 replies
  • 4 kudos

Resolved! Difference between Liquid clustering and Z-ordering

I am trying to understand the difference between Liquid clustering and z-ordering. As per my understanding, both stores the clustered information into ZCubes which is of size 100 GB.Liquid Clustering maintains ZCube id in transaction log so when opti...

  • 11737 Views
  • 8 replies
  • 4 kudos
Latest Reply
canadiandataguy
New Contributor III
  • 4 kudos

I have built a decision tree on how to think about it https://www.canadiandataguy.com/p/optimizing-delta-lake-tables-liquid?triedRedirect=true

  • 4 kudos
7 More Replies
alextc77
by New Contributor II
  • 2959 Views
  • 3 replies
  • 2 kudos

Resolved! Unable to Access Unity Catalog from Cluster in Azure Databricks while Serverless Works

I can't access tables and volumes from the Unity Catalog using a cluster in Azure Databricks, although it works with serverless. Why is this the case?※As for the cluster, the summary displayed the "UnityCatalog" UC badge, and the access mode (data_se...

  • 2959 Views
  • 3 replies
  • 2 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 2 kudos

Thanks for the updates, Alex. Good day.

  • 2 kudos
2 More Replies
chitrar
by New Contributor III
  • 2744 Views
  • 9 replies
  • 4 kudos

workflow/lakeflow -why does it not capture all the metadata of the jobs/tasks

Hi, I see with unity catalog we have the workflow and now the lakeflow schema.   I guess the intention is to capture audit logs of changes/ monitor runs but I wonder why we don't have all the  metadata  info on the jobs /tasks too for a given job   =...

  • 2744 Views
  • 9 replies
  • 4 kudos
Latest Reply
chitrar
New Contributor III
  • 4 kudos

@Sujitha  so, we can expect these enhancements in the "near" future ?

  • 4 kudos
8 More Replies
turagittech
by Contributor
  • 1848 Views
  • 3 replies
  • 1 kudos

External Table refresh

Hi,I have a blob storage area in Azure where json files are being created. I can create an external table on the storage blob container, but when new files are added I don't see extra rows to query the data. Is there a better approach to accessing th...

  • 1848 Views
  • 3 replies
  • 1 kudos
Latest Reply
Nivethan_Venkat
Contributor III
  • 1 kudos

Hi @turagittech,The above error indicates that your table seems to be in DELTA format. Please check the table creation statement, if the table format is JSON or DELTA.PS: By default, if you are not specifying any format while creating the table on to...

  • 1 kudos
2 More Replies
Walter_N
by New Contributor II
  • 1102 Views
  • 2 replies
  • 0 kudos

Resolved! DLT pipeline task with full refresh once in a while

Hi all, I'm using Databricks workflow with some dlt pipeline tasks. These tasks requires a full refresh at some times due to schema changes in the source. I've been doing the full refresh manually or set the full refresh option in the job settings, t...

  • 1102 Views
  • 2 replies
  • 0 kudos
Latest Reply
MariuszK
Valued Contributor III
  • 0 kudos

Hi,Did you check a possibility to use if/else task? You could define some scriteria and pass it from a notebok that will check if it's time for full refresh or just resfres.

  • 0 kudos
1 More Replies
scorpusfx1
by New Contributor II
  • 1424 Views
  • 4 replies
  • 0 kudos

Delta Live Table SCD2 performance issue

Hi Community,I am working on ingestion pipelines that take data from Parquet files (200 MB per day) and integrate them into my Lakehouse. This data is used to create an SCD Type 2 using apply_changes, with the row ID as the key and the file date as t...

Data Engineering
apply_change
dlt
SCD2
  • 1424 Views
  • 4 replies
  • 0 kudos
Latest Reply
Stefan-Koch
Valued Contributor II
  • 0 kudos

hi @scorpusfx1 What kind of source data do you have? Are these parquet files daily full snapshots of source tables? If so, you should use apply_changes_from_snapshot, which is exactly built for this use case. https://docs.databricks.com/aws/en/dlt/py...

  • 0 kudos
3 More Replies
analytics_eng
by New Contributor III
  • 5594 Views
  • 4 replies
  • 1 kudos

Connection reset by peer logging when importing custom package

Hi! I'm trying to import a custom package I published to Azure Artifacts, but I keep seeing the INFO logging below, which I don't want to display. The package was installed correctly on the cluster, and it imports successfully, but the log still appe...

  • 5594 Views
  • 4 replies
  • 1 kudos
Latest Reply
siklosib
New Contributor II
  • 1 kudos

What solved this problem for me is to remove the root logger configuration from the logging config and create another one within the loggers section. See below.{ 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'simple...

  • 1 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels