cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

rt-slowth
by Contributor
  • 1016 Views
  • 2 replies
  • 1 kudos

How to writeStream with redshift

I have already checked the documentation below The documentation below does not describe how to write to streaming.Is there a way to write the gold table (type is streaming table), which is the output of the streaming pipeline of Delta Live Tables in...

  • 1016 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Moderator
  • 1 kudos

Only batch processing is supported.

  • 1 kudos
1 More Replies
umarkhan
by New Contributor II
  • 815 Views
  • 1 replies
  • 0 kudos

Module not found when using applyInPandasWithState in Repos

I should start by saying that everything works fine if I copy and paste it all into a notebook and run it. The problem starts if we try to have any structure in our application repository. Also, so far we have only run into this problem with applyInP...

  • 815 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Moderator
  • 0 kudos

which DBR version are you using? does it works on non DLT jobs?

  • 0 kudos
sher
by Valued Contributor II
  • 601 Views
  • 1 replies
  • 0 kudos

did anyone faced this issue in delta table while genrating manifest file

error message : Manifest generation is not supported for tables that leverage column mapping, as external readers cannot read these Delta tableswhy i got this issue. not sure should we need to do any process ?

  • 601 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Moderator
  • 0 kudos

could you please share the full stack trace and the repro steps?  

  • 0 kudos
VishalD
by New Contributor
  • 527 Views
  • 1 replies
  • 0 kudos

Not able to load nested XML file with struct type

Hello Experts,I am trying to load XML with struct type and having XSI type attribute. below is sample XML format:<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="htt...

  • 527 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Moderator
  • 0 kudos

You can try to use from_xml() function, here is the link to the docs https://docs.databricks.com/en/sql/language-manual/functions/from_xml.html

  • 0 kudos
dbx_687_3__1b3Q
by New Contributor III
  • 2999 Views
  • 2 replies
  • 2 kudos

Resolved! Databricks Asset Bundle (DAB) from existing workspace?

Can anyone point us to some documentation that explains how to create a DAB from an EXISTING workspace? We've been building pipelines, notebooks, tables, etc in a single workspace and a DAB seems like a great way to deploy it all to our Test and Prod...

  • 2999 Views
  • 2 replies
  • 2 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 2 kudos

Hi @dbx_687_3__1b3Q ,    Yes, you can create a Databricks Asset Bundle (DAB) from an existing workspace that contains pipelines, notebooks, tables, and other Databricks assets. To create a DAB from an existing workspace, you can use the Databricks CL...

  • 2 kudos
1 More Replies
hprasad
by New Contributor III
  • 2775 Views
  • 7 replies
  • 1 kudos

Spark read GZ file as corrupted data, when file extension having .GZ in upper case

if file is renamed with file_name.sv.gz (lower case extension) is working fine, if file_name.sv.GZ (upper case extension) the data is read as corrupted, means it simply reading compressed file as is. 

hprasad_0-1705667590987.png
Data Engineering
gzip files
spark-csv
spark.read.csv
  • 2775 Views
  • 7 replies
  • 1 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 1 kudos

Agree but Spark infers the compression from your filename and Spark cannot infer the compression from .GZ format. You can read more about this in below article: https://aws.plainenglish.io/demystifying-apache-spark-quirks-2c91ba2d3978

  • 1 kudos
6 More Replies
vishwanath_1
by New Contributor III
  • 1474 Views
  • 5 replies
  • 1 kudos

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read

reading 130gb file  without  multi line true it is 6 minutes my file has data in multi liner .How to speed up the reading time here .. i am using below commandInputDF=spark.read.option("delimiter","^").option("header",false).option("encoding","UTF-8"...

  • 1474 Views
  • 5 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
4 More Replies
SimDarmapuri
by New Contributor II
  • 698 Views
  • 1 replies
  • 1 kudos

Databricks Deployment using Data Thirst

Hi,I am trying to deploy Databricks Notebooks using Azure Devops to different environments using third party extension Data Thirst (Databricks Script Deployment Task by Data Thirst). The pipeline is able to generate/download artifacts but not able to...

SimDarmapuri_0-1705853167362.png
  • 698 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

the extension is quite old and does not know about Unity Catalog.  So that is probably the reason why it fails.But why do you use the extension for notebook propagation from dev to prd?  You can do this using Repos, feature branches and pull requests...

  • 1 kudos
Michael_Appiah
by New Contributor III
  • 875 Views
  • 1 replies
  • 1 kudos

Resolved! Display Limits Catalog Explorer

It seems as if the Catalog Explorer can only display a maximum of 1000 folders within a UC Volume. I just ran into this issue when I added new folders to a volume which were not displayed in the Catalog Explorer (only folders 1-1000). I was able to r...

  • 875 Views
  • 1 replies
  • 1 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 1 kudos

Hi @Michael_Appiah , This is a known limitation: https://docs.databricks.com/en/connect/unity-catalog/volumes.html#limitations

  • 1 kudos
jonathan-dufaul
by Valued Contributor
  • 2509 Views
  • 4 replies
  • 0 kudos

Resolved! Is there a command in sql cell to ignore formatting for some lines like `# fmt: off` in Python cells

In python cells I can add the comments `# fmt: off` before a block of code that I want black/autoformatter to ignore and `# fmt: on` afterwards. Is there anything similar I can put in sql cells to accomplish the same effect?Some of the recommendation...

Data Engineering
autoformatter
formatter
sql
  • 2509 Views
  • 4 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 0 kudos
3 More Replies
vishwanath_1
by New Contributor III
  • 836 Views
  • 1 replies
  • 0 kudos

Resolved! Need Suggestion for better caching strategy

i have below steps to perform 1.Read a csv file (considerably huge file .. ~100gb)2.add index using zipwithindex function 3.repartition dataframe 4.Passing on to another function .Can you suggest the best optimized caching strategy to execute these c...

vishwanath_1_0-1705915220664.png
  • 836 Views
  • 1 replies
  • 0 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 0 kudos

Hi @vishwanath_1 , Caching only comes into picture when there are multiple reference to data source in your code. As per the flow mentioned by you, I don't see that being the case for you. You are only reading the data from source once and also there...

  • 0 kudos
Pratibha
by New Contributor II
  • 2107 Views
  • 4 replies
  • 1 kudos

Want to set execution termination time/timeout limit for job in job config

Hi , I Want to set execution termination time/timeout limit for job in job config file. please help me how I can do this by pass some parameter in job config file. 

  • 2107 Views
  • 4 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
3 More Replies
ElaPG
by New Contributor III
  • 2607 Views
  • 2 replies
  • 1 kudos

notebooks naming convention

I have read info about objects names but are there any best practices regarding notebooks naming convention?

  • 2607 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
1 More Replies
cyong
by New Contributor II
  • 771 Views
  • 2 replies
  • 0 kudos

Disable CDF on DLT tables

Hi, I noticed Change Data Feed (CDF) is enabled by default for the bronze and gold tables running in DLT. How to check the size of the delta log? Can it be turned off?

  • 771 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 0 kudos
1 More Replies
Ravikumashi
by Contributor
  • 980 Views
  • 2 replies
  • 0 kudos

Extract cluster usage tags from databricks cluster init script

Is it possible we extract cluster usage tags from databricks cluster init script, I am specifically interested in spark.databricks.clusterUsageTags.clusterAllTags.I tried to extract from /databricks/spark/conf/spark.conf and /databricks/spark/conf/sp...

Data Engineering
Azure Databricks
  • 980 Views
  • 2 replies
  • 0 kudos
Latest Reply
Debayan
Esteemed Contributor III
  • 0 kudos

Hi, For reference: https://community.databricks.com/t5/data-engineering/pull-cluster-tags/td-p/19216 , could you please confirm the key expectation here? Extracting as such? 

  • 0 kudos
1 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels
Top Kudoed Authors