cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

databricksdev
by New Contributor II
  • 875 Views
  • 1 replies
  • 0 kudos

Capture Automatically Added tags

Can we capture automatically added tags (ex: RunName) from azure data bricks job cluster to parameters or custom tags in azure data factory

  • 875 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @databricksdev,  Azure Databricks applies default tags to each cluster, including Vendor, Creator, ClusterName, and ClusterId. In addition, it applies two default tags on job clusters: RunName and JobId1. However, these tags are only applied to...

  • 0 kudos
ac0
by New Contributor III
  • 828 Views
  • 2 replies
  • 0 kudos

Resolved! Is it more performant to run optimize table commands on a serverless SQL warehouse or elsewhere?

Is it more performant to run optimize table commands on a serverless SQL warehouse or in a job or all-purpose compute cluster? I would presume a serverless warehouse would be faster, but I don't know how to test this.

  • 828 Views
  • 2 replies
  • 0 kudos
Latest Reply
Yeshwanth
Honored Contributor
  • 0 kudos

@ac0 Good day! Serverless SQL warehouses are likely to execute "optimize table" commands faster than job or all-purpose compute clusters due to their rapid startup time, quick upscaling for low latency, and efficient handling of varying query demand....

  • 0 kudos
1 More Replies
NTRT
by New Contributor III
  • 740 Views
  • 1 replies
  • 0 kudos

how to transform json-stat 2 filte to SparkDataFrame ? how to keep order on MapType structure ?

Hi,I am using different json files of type json-stat2.  These kind of json file is quite common used in national statistisc bureau. Its multi dimensional with multy arrays. Using python environment kan we use pyjstat package to easily  transform json...

  • 740 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

MapType does not maintain order (json itself too).Can you apply the ordering yourself afterwards?

  • 0 kudos
NTRT
by New Contributor III
  • 759 Views
  • 2 replies
  • 0 kudos

cant read json file with just 1,75 MiB ?

Hi,I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.I have a json file (complex-nested) with about 1,73 MiB. when df = spark.read.option("multiLine", "false").json('dbfs:/mnt...

  • 759 Views
  • 2 replies
  • 0 kudos
Latest Reply
koushiknpvs
New Contributor III
  • 0 kudos

This can be resolved by redefining the schema structure explicitly and using that schema to read the file. from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType# Define the schema according to the JSON structuresch...

  • 0 kudos
1 More Replies
NTRT
by New Contributor III
  • 1635 Views
  • 4 replies
  • 0 kudos

Resolved! performance issues when readingjson-stat2

Hi,I am realtively new on databricks, although I am conscious about lazy evaluation, transformations and actions and peristence.I have a json file (complex-nested) with about 1,73 MiB. when df = spark.read.option("multiLine", "false").json('dbfs:/mnt...

  • 1635 Views
  • 4 replies
  • 0 kudos
Latest Reply
koushiknpvs
New Contributor III
  • 0 kudos

Please give me a kudos if this works.Efficiency in Data Collection: Using .collect() on large datasets can lead to out-of-memory errors as it collects all rows to the driver node. If the dataset is large, consider alternatives such as extracting only...

  • 0 kudos
3 More Replies
Mathias_Peters
by Contributor
  • 778 Views
  • 2 replies
  • 0 kudos

Asset Bundles: Adding project_directory in DBT task breaks previous python task

Hi, I have a job consisting of three tasks:  tasks: - task_key: Kinesis_to_S3_new spark_python_task: python_file: ../src/kinesis.py parameters: ["${var.stream_region}", "${var.s3_base_path}"] j...

  • 778 Views
  • 2 replies
  • 0 kudos
Latest Reply
Mathias_Peters
Contributor
  • 0 kudos

Hi @Ajay-Pandey ,thank you for the hints. I will try to recreate the job via UI. I ran the tasks in a Github workflow. The file locations are mixed: the first two tasks (python and dlt) are located in the databricks/src folder. The dbt files come fro...

  • 0 kudos
1 More Replies
chandan_a_v
by Valued Contributor
  • 1868 Views
  • 2 replies
  • 1 kudos

Can't import local files under repo

I have a yaml file inside one of the sub dir in Databricks, I have appended the repo path to sys. Still I can't access this file. https://docs.databricks.com/_static/notebooks/files-in-repos.html

image
  • 1868 Views
  • 2 replies
  • 1 kudos
Latest Reply
Abhishek10745
New Contributor III
  • 1 kudos

Hello @chandan_a_v ,were you able to solve this issue?I am also experiencing the same thing where i cannot move file with extension .yml from repo folder to shared workspace folder.As per documentation, this is the limitation or functionality of data...

  • 1 kudos
1 More Replies
zero234
by New Contributor III
  • 564 Views
  • 1 replies
  • 0 kudos

Delta live table is inserting data multiple times

So I have created a delta live table Which uses spark.sql() to execute a query And uses df.write.mode(append).insert intoTo insert  data into the respective table And at the end i return a dumy table Since this was the requirement So now I have also ...

  • 564 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Moderator
  • 0 kudos

whats your source? your sink is a delta table correct? how do you verify that there are no inserts happening? 

  • 0 kudos
Meshynix
by New Contributor III
  • 4762 Views
  • 6 replies
  • 0 kudos

Resolved! Not able to create external table in a schema under a Catalog.

Problem StatementCluster 1 (Shared Cluster) is not able to read the file location at "dbfs:/mnt/landingzone/landingzonecontainer/Inbound/" and hence we are not able to create an external table in a schema inside Enterprise Catalog.Cluster 2 (No Isola...

  • 4762 Views
  • 6 replies
  • 0 kudos
Latest Reply
Avi_Bricks
New Contributor II
  • 0 kudos

External table creation failing with error :- UnityCatalogServiceException:[RequestId=**** ErrorClass=INVALID_PARAMETER_VALUE] Unsupported path operation PATH_CREATE_TABLE on volume.Able to access and create files on external location.  

  • 0 kudos
5 More Replies
pshuk
by New Contributor III
  • 1045 Views
  • 2 replies
  • 0 kudos

run md5 using CLI

Hi,I want to run a md5 checksum on the uploaded file to databricks. I can generate md5 on the local file but how do I generate one on uploaded file on databricks using CLI (Command line interface). Any help would be appreciated.I tried running databr...

  • 1045 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @pshuk, Unfortunately, the databricks fs md5 command is not supported directly.  You can run a Python script to compute the MD5 hash of the uploaded file.If your uploaded file is stored in Azure Blob Storage, you can use the azcopy tool to calcula...

  • 0 kudos
1 More Replies
Amit_Dass_Chmp
by New Contributor III
  • 591 Views
  • 1 replies
  • 0 kudos

On Unity Catalog - what is the best way to adding members to groups

Hi All, On Unity Catalog - what is the best way to adding members to groups using API or CLI? API should be the best option, but thought to check with you all.  

  • 591 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Amit_Dass_Chmp, In general, both API and CLI can be used to manage members and groups in the Unity Catalog. The choice between the two often depends on your specific use case and comfort level with each tool. APIs are often preferred for their...

  • 0 kudos
danial
by New Contributor II
  • 5000 Views
  • 3 replies
  • 1 kudos

Connect Databricks hosted on Azure, with RDS on AWS.

We have Databricks set up and running on Azure. Now we want to connect it with RDS (AWS) to transfer data from RDS to Azure DataLake using the Databricks.I could find the documentation on how to do it within the same cloud (Either AWS or Azure) but n...

  • 5000 Views
  • 3 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Danial Malik​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so ...

  • 1 kudos
2 More Replies
Michael_Appiah
by Contributor
  • 5171 Views
  • 6 replies
  • 3 kudos

Resolved! Parameterized spark.sql() not working

Spark 3.4 introduced parameterized SQL queries and Databricks also discussed this new functionality in a recent blog post (https://www.databricks.com/blog/parameterized-queries-pyspark)Problem: I cannot run any of the examples provided in the PySpark...

Michael_Appiah_0-1704459542967.png Michael_Appiah_1-1704459570498.png
  • 5171 Views
  • 6 replies
  • 3 kudos
Latest Reply
Michael_Appiah
Contributor
  • 3 kudos

@Cas Unfortunately I do not have any information on this. However, I have seen that DBR 14.3 and 15.0 introduced some changes to spark.sql(). I have not checked whether those changes resolve the issue outlined here. Your best bet is probably to go ah...

  • 3 kudos
5 More Replies
bradleyjamrozik
by New Contributor III
  • 500 Views
  • 1 replies
  • 0 kudos

Autoloader Failure Creating EventSubscription

Posting this here too in case anyone else has run into this issue... Trying to set up Autoloader File Notifications but keep getting an "Internal Server Error" message.Failure on Write EventSubscription - Internal error - Microsoft Q&A

  • 500 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @bradleyjamrozik, Ensure that your service principal for Event Grid and your storage account have the necessary permissions.Specifically, grant the Contributor role to your service principal for Event Grid and your storage account

  • 0 kudos
Phuonganh
by New Contributor II
  • 961 Views
  • 2 replies
  • 2 kudos

Databricks SDK for Python: Errors with parameters for Statement Execution

Hi team,Im using Databricks SDK for python to run SQL queries. I created a variable as below:param = [{'name' : 'a', 'value' :x'}, {'name' : 'b', 'value' : 'y'}]and passed it the statement as below_ = w.statement_execution.execute_statement( warehous...

  • 961 Views
  • 2 replies
  • 2 kudos
Latest Reply
DonkeyKong
New Contributor II
  • 2 kudos

@Kaniz_Fatma This does not help resolve the issue. I am experiencing the same issue when following the above pointers. Here is the statement:response = w.statement_execution.execute_statement( statement='ALTER TABLE users ALTER COLUMN :col_name S...

  • 2 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels