cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

etao
by New Contributor II
  • 140 Views
  • 2 replies
  • 1 kudos

Resolved! How to distribute pyspark dataframe repartition and row count on Databricks?

Try to compare large datasets for discrepancy. The datasets come from two database tables, each with around 500 million rows. I use Pyspark subtract, joins (leftanti, leftsemi) to sorted out the difference. To distribute the workload, I need to repar...

  • 140 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @etao, To distribute the workload effectively, try repartitioning by the join key column or increasing the number of partitions. Use coalesce to reduce partitions without shuffling data. For better performance, consider broadcast joins for smaller...

  • 1 kudos
1 More Replies
ToReSa
by New Contributor
  • 156 Views
  • 5 replies
  • 1 kudos

Read each cell contains SQL from one notebook and execute it on another notebook and export result

Hi, I'm new to databricks, so, excuse me if the question is silly one. I have a requirement to read cell by cell from one notebook (say notebookA) and execute the contents of the cell in another notebook (say notebookB) using a python script. All the...

  • 156 Views
  • 5 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @ToReSa, If you just want to execute the notebook, calling another notebook would be easier. You can even exchange some data between the notebooks. But if you specifically want to pick each SQL from one notebook and execute it in another notebook,...

  • 1 kudos
4 More Replies
sharukh_lodhi
by New Contributor III
  • 39 Views
  • 0 replies
  • 0 kudos

Azure IMDS is not accesbile selecting shared compute policy

Hi, Databricks community,I recently encountered an issue while using the 'azure.identity' Python library on a cluster set to the personal compute policy in Databricks. In this case, Databricks successfully returns the Azure Databricks managed user id...

image.png
Data Engineering
azure IMDS
DefaultAzureCredential
  • 39 Views
  • 0 replies
  • 0 kudos
PiotrU
by Contributor
  • 876 Views
  • 5 replies
  • 1 kudos

Resolved! Adding extra libraries to databricks (rosbag)

HelloI have interesting challenge, I am required to install few libraries which are part of rosbag packages, for allowing some data deserialization tasks.While creating cluster I do use init_script that install this software using apt    sudo apt upd...

lakime_0-1717597430889.png lakime_1-1717597470819.png
  • 876 Views
  • 5 replies
  • 1 kudos
Latest Reply
amandaK
New Contributor
  • 1 kudos

@PiotrU did adding the path to sys.path resolve all of your ModuleNotFoundErrors? i'm trying to do something similar and adding the path to the sys.path resolved ModuleNotFoundError for rclpy, but i continue to see others related to ros

  • 1 kudos
4 More Replies
subhas_1729
by New Contributor
  • 24 Views
  • 1 replies
  • 0 kudos

CSV file and partitions

Hi       I want to know whether csv files can be partitioned or not. I find from a book that parque, Avro .. these types of files can be partitioned only. RegardsSubhas

  • 24 Views
  • 1 replies
  • 0 kudos
Latest Reply
Witold
Contributor
  • 0 kudos

Basically every file type can be partitioned, as technically partitions are just sub folders.

  • 0 kudos
Wenhui
by New Contributor II
  • 98 Views
  • 3 replies
  • 0 kudos

How Troubleshooting in user's env

Hi team,  I want to do POC,  but here I have a question confusing me is  that if your teams engineer need access our data plan env to troubleshooting for us, How do you do can get permission to access our env ? could you help me, thank you very much.

  • 98 Views
  • 3 replies
  • 0 kudos
Latest Reply
Slash
New Contributor III
  • 0 kudos

Hi @Wenhui ,But what's your setup? Which cloud provider? Do you use Unity Catalog?

  • 0 kudos
2 More Replies
labromb
by Contributor
  • 10813 Views
  • 9 replies
  • 4 kudos

How to pass configuration values to a Delta Live Tables job through the Delta Live Tables API

Hi Community,I have successfully run a job through the API but would need to be able to pass parameters (configuration) to the DLT workflow via the APII have tried passing JSON in this format:{ "full_refresh": "true", "configuration": [ ...

  • 10813 Views
  • 9 replies
  • 4 kudos
Latest Reply
Manjula_Ganesap
Contributor
  • 4 kudos

@Mo - it worked. Thank you so much.

  • 4 kudos
8 More Replies
seeker
by Visitor
  • 38 Views
  • 1 replies
  • 0 kudos

Get metadata of files present in a zip

I have a .zip file present on an ADLS path which contains multiple files of different formats. I want to get metadata of the files like file name, modification time present in it without unzipping it. I have a code which works for smaller zip but run...

  • 38 Views
  • 1 replies
  • 0 kudos
Latest Reply
seeker
Visitor
  • 0 kudos

 Here is the code which i am using def register_udf(): def extract_file_metadata_from_zip(binary_content): metadata_list = [] with io.BytesIO(binary_content) as bio: with zipfile.ZipFile(bio, "r") as zip_ref: ...

  • 0 kudos
Ulman
by New Contributor II
  • 1344 Views
  • 8 replies
  • 0 kudos

Switching to File Notification Mode with ADLS Gen2 - Encountering StorageException

Hello,We are currently utilizing an autoloader with file listing mode for a stream, which is experiencing significant latency due to the non-incremental naming of files in the directory—a condition that cannot be altered.In an effort to mitigate this...

Data Engineering
ADLS gen2
autoloader
file notification mode
  • 1344 Views
  • 8 replies
  • 0 kudos
Latest Reply
Rah_Cencora
  • 0 kudos

You should also reevaluate your use of premium storage for your landing area files. Typically, storage for raw files does not need to be the fastest and most resilient and expensive tier. Unless you have a compelling reason for premium storage for la...

  • 0 kudos
7 More Replies
ibrahim21124
by New Contributor III
  • 566 Views
  • 7 replies
  • 0 kudos

Autoloader File Notification Mode not working as expected

I am using this given code to read from a source location in ADLS Gen 2 Azure Storage Container. core_df = (        spark.readStream.format("cloudFiles")        .option("cloudFiles.format", "json")        .option("multiLine", "false")        .option(...

  • 566 Views
  • 7 replies
  • 0 kudos
Latest Reply
Rishabh_Tiwari
Community Manager
  • 0 kudos

Hi @ibrahim21124 , Thank you for reaching out to our community! We're here to help you.  To ensure we provide you with the best support, could you please take a moment to review the response and choose the one that best answers your question? Your fe...

  • 0 kudos
6 More Replies
YFL
by New Contributor III
  • 4887 Views
  • 12 replies
  • 6 kudos

Resolved! When delta is a streaming source, how can we get the consumer lag?

Hi, I want to keep track of the streaming lag from the source table, which is a delta table. I see that in query progress logs, there is some information about the last version and the last file in the version for the end offset, but this don't give ...

  • 4887 Views
  • 12 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hey @Yerachmiel Feltzman​ I hope all is well.Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!

  • 6 kudos
11 More Replies
lshar
by New Contributor III
  • 20315 Views
  • 8 replies
  • 5 kudos

Resolved! How do I pass arguments/variables from widgets to notebooks?

Hello,I am looking for a solution to this problem, which is known since 7 years: https://community.databricks.com/s/question/0D53f00001HKHZfCAP/how-do-i-pass-argumentsvariables-to-notebooksWhat I need is to parametrize my notebooks using widget infor...

example_if_run
  • 20315 Views
  • 8 replies
  • 5 kudos
Latest Reply
T_Ash
Visitor
  • 5 kudos

Can we create paginated reports with multiple parameters(one parameter can dynamically change other parameter) or we can pass one variable from one dataset to other dataset like power bi paginated report using Databricks dashboard, please let me know...

  • 5 kudos
7 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels
Top Kudoed Authors