cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

arthurburkhardt
by New Contributor
  • 1492 Views
  • 2 replies
  • 1 kudos

Auto Loader changes the order of columns when infering JSON schema (sorted lexicographically)

We are using Auto Loader to read json files from S3 and ingest data into the bronze layer. But it seems auto loader struggles with schema inference and instead of preserving the order of columns from the JSON files, it sorts them lexicographically.Fo...

Data Engineering
auto.loader
json
schema
  • 1492 Views
  • 2 replies
  • 1 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 1 kudos

Auto Loader's default behavior of sorting columns lexicographically during schema inference is indeed a limitation when preserving the original order of JSON fields is important. Unfortunately, there isn't a built-in option in Auto Loader to maintain...

  • 1 kudos
1 More Replies
simple89
by New Contributor
  • 938 Views
  • 1 replies
  • 0 kudos

Runtime increases exponentially from 11.3 to 13.3

Hello. I am using R on databricks and using the below approach. My Spark version:Single node: i3.2xlarge · On-demand · DBR: 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12) · us-east-1a, the job takes 1 hourI install all R packages (including a geo...

  • 938 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hello! It's possible that the increase in runtime when upgrading from Spark 3.3.0 (DBR 11.3) to Spark 3.4.1 (DBR 13.3) is due to changes in the underlying R runtime or package versions. When you upgrade to a new version of Spark, the R packages that ...

  • 0 kudos
rcostanza
by New Contributor III
  • 1036 Views
  • 1 replies
  • 1 kudos

Changing a Delta Live Table's schema

I have a Delta Live Table whose source is a Kafka stream. One of the columns is a Decimal and I need to change its precision.What's the correct approach to changing the DLT's schema?Just changing the column's precision in the DLT definition will resu...

  • 1036 Views
  • 1 replies
  • 1 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 1 kudos

To change the precision of a Decimal column in a Delta Live Table (DLT) with a Kafka stream source, you can follow these steps: 1. Create a new column in the DLT with the desired precision.2. Copy the data from the old column to the new column.3. Dro...

  • 1 kudos
lprevost
by Contributor II
  • 948 Views
  • 1 replies
  • 0 kudos

sampleBy stream in DLT

I would like to create a sampleBy (stratified version of sample) copy/clone of my delta table.   Ideally, I'd like to do this using a DLT.     My source table grows incrementally each month as batch files are added and autoloader picks them up.    Id...

  • 948 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

You can create a stratified sample of your delta table using the `sampleBy` function in Databricks. However, DLT  does not support the `sampleBy` function directly. To work around this, you can create a notebook that uses the `sampleBy` function to c...

  • 0 kudos
zmwaris1
by New Contributor II
  • 850 Views
  • 1 replies
  • 2 kudos

Connect databricks delta table to Apache Kyln using JDBC

I am using Apache Kylin for Data Analytics and Databricks for data modelling and filtering. I have my final data in gold tables and I would like to integrate this data with Apache Kylin using JDBC where the gold table will be the Data Source. I would...

  • 850 Views
  • 1 replies
  • 2 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 2 kudos

Yes, it is possible to integrate your Databricks gold tables with Apache Kylin using JDBC. This integration allows you to use Apache Kylin's OLAP capabilities on the data stored in your Databricks environment. Here's how you can achieve this: ## Conn...

  • 2 kudos
YOUKE
by New Contributor III
  • 887 Views
  • 4 replies
  • 1 kudos

Resolved! Managed Tables on Azure databricks

Hi everyone,I was trying to understand: when a managed table is created, Databricks stores the metadata in the Hive metastore and the data in the cloud storage managed by it, which in the case of Azure Databricks will be an Azure Storage Account. But...

  • 887 Views
  • 4 replies
  • 1 kudos
Latest Reply
BraydenJordan
New Contributor II
  • 1 kudos

Thank you so much for the solution.

  • 1 kudos
3 More Replies
Gusman
by New Contributor II
  • 305 Views
  • 1 replies
  • 1 kudos

Resolved! How to send BINARY parameters using the REST Sql API?

We are trying to send a SQL query to the REST API including a BINARY parameter, EX:"INSERT INTO MyTable (BinaryField) VALUES(:binaryData)"We tried to encode the parameter as base64 and specify that is a BINARY type but it throws a mapping error, if w...

  • 305 Views
  • 1 replies
  • 1 kudos
Latest Reply
cgrant
Databricks Employee
  • 1 kudos

Trying to serialize as binary can be pretty challenging, here's a way to do this with base64 - the trick is to serialize as base64 string and insert as binary with unbase64.  databricks api post /api/2.0/sql/statements --json '{ "warehouse_id": "wa...

  • 1 kudos
AcrobaticMonkey
by New Contributor II
  • 491 Views
  • 2 replies
  • 0 kudos

Alerts for Failed Queries in Databricks

How can we set up automated alerts to notify us when queries executed by a specific service principal fail in Databricks?

  • 491 Views
  • 2 replies
  • 0 kudos
Latest Reply
AcrobaticMonkey
New Contributor II
  • 0 kudos

@Alberto_UmanaOur service principal uses the SQL Statement API to execute queries. We want to receive notifications for each query failure. While SQL Alerts are an option, they do not provide immediate responses. Is there a better solution to achieve...

  • 0 kudos
1 More Replies
seanstachff
by New Contributor II
  • 557 Views
  • 5 replies
  • 0 kudos

Resolved! Using FROM_CSV giving unexpected results

Hello, I am trying to use from_csv in the sql warehouse, but I am getting unexpected results:As a small example I am running: WITH your_table AS ( SELECT 'a,b,c\n1,"hello, world",3.14\n2,"goodbye, world",2.71' AS csv_column ) SELECT from_csv(csv_c...

  • 557 Views
  • 5 replies
  • 0 kudos
Latest Reply
Takuya-Omi
Valued Contributor III
  • 0 kudos

@seanstachff Here is the code I used to produce the results shown in the image I shared earlier. It's a bit verbose, so I’m not entirely satisfied with it, but I hope it might provide some helpful insights for you.%sql WITH your_table AS ( -- Examp...

  • 0 kudos
4 More Replies
martkev
by New Contributor
  • 924 Views
  • 1 replies
  • 0 kudos

Networking Setup in Standard Tier – VNet Integration and Proxy Issues

Hi everyone,We are working on an order forecasting model using azure databricks and an ml model from Hugging Face and are running into an issue where the connection over SSL (port 443) fails during the handshake (EOF Error SSL 992). We suspect that a...

  • 924 Views
  • 1 replies
  • 0 kudos
Latest Reply
arjun_kr
Databricks Employee
  • 0 kudos

It may depend on your UDR setup. If you have a UDR rule routing the traffic to any firewall appliance, it may possibly be related to traffic not being allowed in the firewall. If there is no UDR or UDR rule routes this traffic to the Internet, it wou...

  • 0 kudos
Anonymous
by Not applicable
  • 16318 Views
  • 8 replies
  • 14 kudos

Resolved! MetadataChangedException

A delta lake table is created with identity column and I'm not able to load the data parallelly from four process. i'm getting the metadata exception error.I don't want to load the data in temp table . Need to load directly and parallelly in to delta...

  • 16318 Views
  • 8 replies
  • 14 kudos
Latest Reply
cpc0707
New Contributor II
  • 14 kudos

I'm having the same issue, need to load a large amount of data from separate files into a delta table and I want to do it with a for each loop so I don't have to run it sequentially which will take days. There should be a way to handle this 

  • 14 kudos
7 More Replies
Ulman
by New Contributor II
  • 3825 Views
  • 9 replies
  • 1 kudos

Switching to File Notification Mode with ADLS Gen2 - Encountering StorageException

Hello,We are currently utilizing an autoloader with file listing mode for a stream, which is experiencing significant latency due to the non-incremental naming of files in the directory—a condition that cannot be altered.In an effort to mitigate this...

Data Engineering
ADLS gen2
autoloader
file notification mode
  • 3825 Views
  • 9 replies
  • 1 kudos
Latest Reply
Rah_Cencora
New Contributor II
  • 1 kudos

You should also reevaluate your use of premium storage for your landing area files. Typically, storage for raw files does not need to be the fastest and most resilient and expensive tier. Unless you have a compelling reason for premium storage for la...

  • 1 kudos
8 More Replies
vanverne
by New Contributor II
  • 869 Views
  • 2 replies
  • 1 kudos

Assistance with Capturing Auto-Generated IDs in Databricks SQL

Hello,I am currently working on a project where I need to insert multiple rows into a table and capture the auto-generated IDs for each row. I am using databricks sql connector. Here is a simplified version of my current workflow:I create a temporary...

  • 869 Views
  • 2 replies
  • 1 kudos
Latest Reply
vanverne
New Contributor II
  • 1 kudos

Thanks for the reply, Alfonso. I noticed you mentioned "Below are a few alternatives...", however, I am not seeing those. Please let me know if I am missing something. Also, do you know if Databricks is working on supporting the RETURNING clause soon...

  • 1 kudos
1 More Replies
angelop
by New Contributor
  • 177 Views
  • 1 replies
  • 0 kudos

Databricks Clean Rooms creation

I am trying to create a Databricks Clean Rooms instance, I have been following the video from Databricks youtube channel.As I only have one workspace, to create a clean rooms I have added my own Clean Room sharing identifier,when I do that I get the ...

  • 177 Views
  • 1 replies
  • 0 kudos
Latest Reply
Takuya-Omi
Valued Contributor III
  • 0 kudos

@angelop I tried it as well and encountered the same error. A new collaborator needs to be set up. If that’s not feasible, it would be advisable to reach out to Databricks support.By the way, the following video provides a more detailed explanation a...

  • 0 kudos
The_Demigorgan
by New Contributor
  • 1492 Views
  • 1 replies
  • 0 kudos

Autoloader issue

I'm trying to ingest data from Parquet files using Autoloader. Now, I have my custom schema, I don't want to infer the schema from the parquet files.During readstream everything is fine. But during writestream, it is somehow inferring the schema from...

  • 1492 Views
  • 1 replies
  • 0 kudos
Latest Reply
cgrant
Databricks Employee
  • 0 kudos

In this case, please make sure you specify the schema explicitly when reading the Parquet files and do not specify any inference options. Something like spark.readStream.format("cloudFiles").schema(schema)... If you want to more easily grab the schem...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels