cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

hprasad
by New Contributor III
  • 2643 Views
  • 7 replies
  • 1 kudos

Spark read GZ file as corrupted data, when file extension having .GZ in upper case

if file is renamed with file_name.sv.gz (lower case extension) is working fine, if file_name.sv.GZ (upper case extension) the data is read as corrupted, means it simply reading compressed file as is. 

hprasad_0-1705667590987.png
Data Engineering
gzip files
spark-csv
spark.read.csv
  • 2643 Views
  • 7 replies
  • 1 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 1 kudos

Agree but Spark infers the compression from your filename and Spark cannot infer the compression from .GZ format. You can read more about this in below article: https://aws.plainenglish.io/demystifying-apache-spark-quirks-2c91ba2d3978

  • 1 kudos
6 More Replies
vishwanath_1
by New Contributor III
  • 1365 Views
  • 5 replies
  • 1 kudos

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read

reading 130gb file  without  multi line true it is 6 minutes my file has data in multi liner .How to speed up the reading time here .. i am using below commandInputDF=spark.read.option("delimiter","^").option("header",false).option("encoding","UTF-8"...

  • 1365 Views
  • 5 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
4 More Replies
SimDarmapuri
by New Contributor II
  • 641 Views
  • 1 replies
  • 1 kudos

Databricks Deployment using Data Thirst

Hi,I am trying to deploy Databricks Notebooks using Azure Devops to different environments using third party extension Data Thirst (Databricks Script Deployment Task by Data Thirst). The pipeline is able to generate/download artifacts but not able to...

SimDarmapuri_0-1705853167362.png
  • 641 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

the extension is quite old and does not know about Unity Catalog.  So that is probably the reason why it fails.But why do you use the extension for notebook propagation from dev to prd?  You can do this using Repos, feature branches and pull requests...

  • 1 kudos
Michael_Appiah
by New Contributor III
  • 823 Views
  • 1 replies
  • 1 kudos

Resolved! Display Limits Catalog Explorer

It seems as if the Catalog Explorer can only display a maximum of 1000 folders within a UC Volume. I just ran into this issue when I added new folders to a volume which were not displayed in the Catalog Explorer (only folders 1-1000). I was able to r...

  • 823 Views
  • 1 replies
  • 1 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 1 kudos

Hi @Michael_Appiah , This is a known limitation: https://docs.databricks.com/en/connect/unity-catalog/volumes.html#limitations

  • 1 kudos
jonathan-dufaul
by Valued Contributor
  • 2413 Views
  • 4 replies
  • 0 kudos

Resolved! Is there a command in sql cell to ignore formatting for some lines like `# fmt: off` in Python cells

In python cells I can add the comments `# fmt: off` before a block of code that I want black/autoformatter to ignore and `# fmt: on` afterwards. Is there anything similar I can put in sql cells to accomplish the same effect?Some of the recommendation...

Data Engineering
autoformatter
formatter
sql
  • 2413 Views
  • 4 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 0 kudos
3 More Replies
vishwanath_1
by New Contributor III
  • 787 Views
  • 1 replies
  • 0 kudos

Resolved! Need Suggestion for better caching strategy

i have below steps to perform 1.Read a csv file (considerably huge file .. ~100gb)2.add index using zipwithindex function 3.repartition dataframe 4.Passing on to another function .Can you suggest the best optimized caching strategy to execute these c...

vishwanath_1_0-1705915220664.png
  • 787 Views
  • 1 replies
  • 0 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 0 kudos

Hi @vishwanath_1 , Caching only comes into picture when there are multiple reference to data source in your code. As per the flow mentioned by you, I don't see that being the case for you. You are only reading the data from source once and also there...

  • 0 kudos
Pratibha
by New Contributor II
  • 1969 Views
  • 4 replies
  • 1 kudos

Want to set execution termination time/timeout limit for job in job config

Hi , I Want to set execution termination time/timeout limit for job in job config file. please help me how I can do this by pass some parameter in job config file. 

  • 1969 Views
  • 4 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
3 More Replies
ElaPG
by New Contributor III
  • 2467 Views
  • 2 replies
  • 1 kudos

notebooks naming convention

I have read info about objects names but are there any best practices regarding notebooks naming convention?

  • 2467 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
1 More Replies
cyong
by New Contributor II
  • 738 Views
  • 2 replies
  • 0 kudos

Disable CDF on DLT tables

Hi, I noticed Change Data Feed (CDF) is enabled by default for the bronze and gold tables running in DLT. How to check the size of the delta log? Can it be turned off?

  • 738 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 0 kudos
1 More Replies
Ravikumashi
by Contributor
  • 920 Views
  • 2 replies
  • 0 kudos

Extract cluster usage tags from databricks cluster init script

Is it possible we extract cluster usage tags from databricks cluster init script, I am specifically interested in spark.databricks.clusterUsageTags.clusterAllTags.I tried to extract from /databricks/spark/conf/spark.conf and /databricks/spark/conf/sp...

Data Engineering
Azure Databricks
  • 920 Views
  • 2 replies
  • 0 kudos
Latest Reply
Debayan
Esteemed Contributor III
  • 0 kudos

Hi, For reference: https://community.databricks.com/t5/data-engineering/pull-cluster-tags/td-p/19216 , could you please confirm the key expectation here? Extracting as such? 

  • 0 kudos
1 More Replies
Vishwanath_Rao
by New Contributor II
  • 1816 Views
  • 2 replies
  • 0 kudos

Same path producing different counts on Databricks and EMR

We're in the middle of migrating to Databricks and found that the same path on s3 is producing different counts between EMR (Spark 2.4.4) and Databricks (Spark 3.4.1) it is a simple spark.read.parquet().count(), tried multiple solutions like making t...

  • 1816 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 0 kudos
1 More Replies
sudhakargen
by New Contributor II
  • 4361 Views
  • 2 replies
  • 0 kudos

Intermittently unavailable: Maven library com.crealytics:spark-excel_2.12:3.5.0_0.20.3

The issue is that the package com.crealytics:spark-excel_2.12:3.5.0_0.20.3 is intermittently unavailable i.e. most of the times excel import works and few times it fails with exception (org.apache.spark.SparkClassNotFoundException).I have installed m...

  • 4361 Views
  • 2 replies
  • 0 kudos
Latest Reply
sudhakargen
New Contributor II
  • 0 kudos

"Looks like the issue is source is not able to reach" - Can you please let me know what you mean by this.Libraries installed on the databricks cluster are as below, I have a cluster with14.2 version on which I have installed maven library(com.crealyt...

  • 0 kudos
1 More Replies
BartoszBiskupsk
by New Contributor II
  • 1558 Views
  • 2 replies
  • 0 kudos

"Last Access" information for external delta tables (no UC)

Hi,Is there a way to make audit on all tables in hive_metastore (no UC), all are external, to check when each has been used for the last time (queried / updated / etc). ?

Data Engineering
access logs
  • 1558 Views
  • 2 replies
  • 0 kudos
Latest Reply
CharlesReily
New Contributor III
  • 0 kudos

Apache Ranger or Apache Sentry can be used for auditing Hive activities. If you have set up auditing in one of these tools, you can review the audit logs to see when tables were accessed. Audit logs are typically stored in a separate location, and yo...

  • 0 kudos
1 More Replies
drii_cavalcanti
by New Contributor III
  • 683 Views
  • 2 replies
  • 0 kudos

Shared Mode Cluster Permission Issue: Editing Folders Across Users

Hi everyone,Currently, I save logs to a specific folder at the root level in Databricks. However, I need to use a Shared Mode cluster, and it seems I no longer have permission to save to the folder or even open its terminal to access the underlying i...

  • 683 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 0 kudos
1 More Replies
hbs59
by New Contributor III
  • 3489 Views
  • 5 replies
  • 2 kudos

Resolved! Rest API Error 404

I am trying to export a notebook or directory using /api/2.0/workspace/export.When I run /api/2.0/workspace/list with a particular url and path, I get the results that I expect, a list of objects (notebooks and folders) at that location.But when I ru...

  • 3489 Views
  • 5 replies
  • 2 kudos
Latest Reply
Debayan
Esteemed Contributor III
  • 2 kudos

Hi, Could you please remove the parameters , (format and direct_download) and confirm? 

  • 2 kudos
4 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels