cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

maxutil
by New Contributor II
  • 16230 Views
  • 4 replies
  • 3 kudos

Invalid Characters in Column Names " ,;{}()\n\t="

I'm reading data into a dataframe withdf = spark.read.json("s3://somepath/")I've tried first creating a delta table using the DeltaTable API with:DeltaTable.createIfNotExists(spark)\ .location(target_path)\ .addColumns(df.sche...

  • 16230 Views
  • 4 replies
  • 3 kudos
Latest Reply
VZLA
Databricks Employee
  • 3 kudos

@jb1z @maxutil Have you tried it like this?   import dlt @dlt.table(table_properties={'quality': 'bronze', 'delta.columnMapping.mode': 'name'}) def netsuite_items_inventory_price(): return ( spark.readStream.format('cloudFiles') ...

  • 3 kudos
3 More Replies
boskicl
by New Contributor III
  • 27589 Views
  • 6 replies
  • 10 kudos

Resolved! Table write command stuck "Filtering files for query."

Hello all,Background:I am having an issue today with databricks using pyspark-sql and writing a delta table. The dataframe is made by doing an inner join between two tables and that is the table which I am trying to write to a delta table. The table ...

filtering job_info spill_memory
  • 27589 Views
  • 6 replies
  • 10 kudos
Latest Reply
timo199
New Contributor II
  • 10 kudos

Even if I vacuum and optimize, it keeps getting stuck.cluster type is r6gd.xlarge min:4, max:6driver type is r6gd.2xlarge

  • 10 kudos
5 More Replies
bobbysidhartha
by New Contributor
  • 15491 Views
  • 2 replies
  • 0 kudos

How to parallelly merge data into partitions of databricks delta table using PySpark/Spark streaming?

I have a PySpark streaming pipeline which reads data from a Kafka topic, data undergoes thru various transformations and finally gets merged into a databricks delta table. In the beginning we were loading data into the delta table by using the merge ...

WbOeJ 6MYWV
  • 15491 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@bobbysidhartha​ :When merging data into a partitioned Delta table in parallel, it is important to ensure that each job only accesses and modifies the files in its own partition to avoid concurrency issues. One way to achieve this is to use partition...

  • 0 kudos
1 More Replies
tototox
by New Contributor III
  • 13420 Views
  • 4 replies
  • 2 kudos

how to check table size by partition?

I want to check the size of the delta table by partition.As you can see, only the size of the table can be checked, but not by partition.

  • 13420 Views
  • 4 replies
  • 2 kudos
Latest Reply
Carsten_Herbe
New Contributor II
  • 2 kudos

The previous two answers did not work for me (DBX 15.4).I found a hacky way using the delta log: find latest (group of) checkpoint (parquet) file(s) in delta log and use it as source prefix `000000000000xxxxxxx.checkpoint`:SELECT partition_column_1,...

  • 2 kudos
3 More Replies
MRTN
by New Contributor III
  • 10882 Views
  • 5 replies
  • 3 kudos

Resolved! Feature request delta tables : drop duplicate rows

A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online. Or am I missing any existing procedures that can be done withouth merge operations or similar?

  • 10882 Views
  • 5 replies
  • 3 kudos
Latest Reply
MRTN
New Contributor III
  • 3 kudos

I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github.com)

  • 3 kudos
4 More Replies
Jana
by New Contributor III
  • 8094 Views
  • 9 replies
  • 4 kudos

Resolved! Parsing 5 GB json file is running long on cluster

I was creating delta table from ADLS json input file. but the job was running long while creating delta table from json. Below is my cluster configuration. Is the issue related to cluster config ? Do I need to upgrade the cluster config ?The cluster ...

  • 8094 Views
  • 9 replies
  • 4 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 4 kudos

with multiline = true, the json is read as a whole and processed as such.I'd try with a beefier cluster.

  • 4 kudos
8 More Replies
RKNutalapati
by Valued Contributor
  • 4004 Views
  • 5 replies
  • 4 kudos

Read and saving Blob data from oracle to databricks S3 is slow

I am trying to import a table from oracle which has around 1.3 mill rows and one of the column is a Blob, the total size of data on oracle is around 250+ GB. read and save to S3 as delta table is taking around 60 min. I tried with parallel(200 thread...

  • 4004 Views
  • 5 replies
  • 4 kudos
Latest Reply
vinita_mehta
New Contributor II
  • 4 kudos

Any update on this topic what should be the best option to read from oracle and write in ADLS. 

  • 4 kudos
4 More Replies
elgeo
by Valued Contributor II
  • 3461 Views
  • 6 replies
  • 8 kudos

Clean up _delta_log files

Hello experts. We are trying to clarify how to clean up the large amount of files that are being accumulated in the _delta_log folder (json, crc and checkpoint files). We went through the related posts in the forum and followed the below:SET spark.da...

  • 3461 Views
  • 6 replies
  • 8 kudos
Latest Reply
Brad
Contributor II
  • 8 kudos

Awesome, thanks for response.

  • 8 kudos
5 More Replies
DJey
by New Contributor III
  • 14167 Views
  • 6 replies
  • 2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

image image image image
  • 14167 Views
  • 6 replies
  • 2 kudos
Latest Reply
Amin112
New Contributor II
  • 2 kudos

In Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:MERGE WITH SCHEMA EVOLUTION INTO targetUSING sourceON source.key = target.keyWHEN MATCHED THENUPDATE SET *WHEN NOT MATCHED THENI...

  • 2 kudos
5 More Replies
suresh1122
by New Contributor III
  • 13940 Views
  • 12 replies
  • 7 kudos

dataframe takes unusually long time to save as a delta table using sql for a very small dataset with 30k rows. It takes around 2hrs. Is there a solution for this problem?

I am trying to save a dataframe after a series of data manipulations using Udf functions to a delta table. I tried using this code( df .write .format('delta') .mode('overwrite') .option('overwriteSchema', 'true') .saveAsTable('output_table'))but this...

  • 13940 Views
  • 12 replies
  • 7 kudos
Latest Reply
Lakshay
Databricks Employee
  • 7 kudos

You should also look into the sql plan if the writing phase is indeed the part that is taking time. Since spark works on lazy evaluation, there might be some other phase that might be taking time

  • 7 kudos
11 More Replies
YFL
by New Contributor III
  • 6780 Views
  • 11 replies
  • 6 kudos

Resolved! When delta is a streaming source, how can we get the consumer lag?

Hi, I want to keep track of the streaming lag from the source table, which is a delta table. I see that in query progress logs, there is some information about the last version and the last file in the version for the end offset, but this don't give ...

  • 6780 Views
  • 11 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hey @Yerachmiel Feltzman​ I hope all is well.Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!

  • 6 kudos
10 More Replies
ptambe
by New Contributor III
  • 4983 Views
  • 6 replies
  • 3 kudos

Resolved! Is Concurrent Writes from multiple databricks clusters to same delta table on S3 Supported?

Does databricks have support for writing to same Delta Table from multiple clusters concurrently. I am specifically interested to know if there is any solution for https://github.com/delta-io/delta/issues/41 implemented in databricks OR if you have a...

  • 4983 Views
  • 6 replies
  • 3 kudos
Latest Reply
dennyglee
Databricks Employee
  • 3 kudos

Please note, the issue noted above [Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) is for Delta Lake OSS. As noted in this issue as well as Issue 324, as of this writing, S3 lacks putIfAbsent transactional consistency. For Del...

  • 3 kudos
5 More Replies
Gary_Irick
by New Contributor III
  • 10136 Views
  • 9 replies
  • 10 kudos

Delta table partition directories when column mapping is enabled

I recently created a table on a cluster in Azure running Databricks Runtime 11.1. The table is partitioned by a "date" column. I enabled column mapping, like this:ALTER TABLE {schema}.{table_name} SET TBLPROPERTIES('delta.columnMapping.mode' = 'nam...

  • 10136 Views
  • 9 replies
  • 10 kudos
Latest Reply
talenik
New Contributor III
  • 10 kudos

Hi @Retired_mod , I have few queries on Directory Names with Column Mapping. I have this delta table on ADLS and I am trying to read it, but I am getting below error. How can we read delta tables with column mapping enabled with pyspark?Can you pleas...

  • 10 kudos
8 More Replies
deng77
by New Contributor III
  • 45091 Views
  • 11 replies
  • 2 kudos

Resolved! Using current_timestamp as a default value in a delta table

I want to add a column to an existing delta table with a timestamp for when the data was inserted. I know I can do this by including current_timestamp with my SQL statement that inserts into the table. Is it possible to add a column to an existing de...

  • 45091 Views
  • 11 replies
  • 2 kudos
Latest Reply
Vaibhav1000
New Contributor II
  • 2 kudos

Can you please provide information on the additional expenses related to using this feature compared to not utilizing it at all?

  • 2 kudos
10 More Replies
Chris_Konsur
by New Contributor III
  • 19197 Views
  • 4 replies
  • 6 kudos

Resolved! Error: The associated location ... is not empty but it's not a Delta table

I try to create a table but I get this error: AnalysisException: Cannot create table ('`spark_catalog`.`default`.`citation_all_tenants`'). The associated location ('dbfs:/user/hive/warehouse/citation_all_tenants') is not empty but it's not a Delta t...

  • 19197 Views
  • 4 replies
  • 6 kudos
Latest Reply
sachin_tirth
New Contributor II
  • 6 kudos

Hi Team, I am facing the same issue. When we try to load data to table in production batch getting error as table not in delta format. there is no recent change in table. and we are not trying any create or replace table. this is existing table in pr...

  • 6 kudos
3 More Replies
Labels