Data Engineering

Forum Posts

Sorted by:

by Sen • New Contributor

02-11-2023 8:57:24 AM

11084 Views
9 replies
1 kudos

Resolved! Performance enhancement while writing dataframes into Parquet tables

Hi,I am trying to write the contents of a dataframe into a parquet table using the command below.df.write.mode("overwrite").format("parquet").saveAsTable("sample_parquet_table")The dataframe contains an extract from one of our source systems, which h...

Data Engineering

11084 Views
9 replies
1 kudos

02-11-2023 8:57:24 AM

View Replies

Latest Reply

jhoon
New Contributor II

12-03-2024 6:47:24 AM

1 kudos

Great discussion on performance optimization! Managing technical projects like these alongside academic work can be demanding. If you need expert academic support to free up time for your professional pursuits, Dissertation Help Services is here to a...

1 kudos

12-03-2024 6:47:24 AM

8 More Replies

by tarente • New Contributor III

11-22-2021 3:15:54 AM

3838 Views
3 replies
3 kudos

Partitioned parquet table (folder) with different structure

Hi,We have a parquet table (folder) in Azure Storage Account.The table is partitioned by column PeriodId (represents a day in the format YYYYMMDD) and has data from 20181001 until 20211121 (yesterday).We have a new development that adds a new column ...

Data Engineering

3838 Views
3 replies
3 kudos

11-22-2021 3:15:54 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-22-2021 3:34:50 AM

3 kudos

I think problem is in overwrite as when you overwrite it overwrites all folders. Solution is to mix append with dynamic overwrite so it will overwrite only folders which have data and doesn't affect old partitions:spark.conf.set("spark.sql.sources.pa...

3 kudos

11-22-2021 3:34:50 AM

2 More Replies

by User16826987838 • Contributor

06-25-2021 12:52:38 PM

1739 Views
1 replies
0 kudos

Refreshing external tables

After I vacuum the tables, do i need to update the manifest table and parquet table to refresh my external tables for integrations to work?

Data Engineering

1739 Views
1 replies
0 kudos

06-25-2021 12:52:38 PM

View Replies

Latest Reply

Taha
Databricks Employee

06-25-2021 4:22:26 PM

0 kudos

Manifest files need to be re-created when partitions are added or altered. Since a VACUUM only deletes all historical versions, you shouldn't need to create an updated manifest file unless you are also running an OPTIMIZE.

0 kudos

06-25-2021 4:22:26 PM

by brickster_2018 • Databricks Employee

06-24-2021 2:21:47 AM

1665 Views
1 replies
0 kudos

Resolved! Why do I always get an error on querying the Parquet table - Parquet does not support timestamp

Data Engineering

1665 Views
1 replies
0 kudos

06-24-2021 2:21:47 AM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-24-2021 2:24:16 AM

0 kudos

The issue can happen if the Hive syntax for table creation is used instead of the Spark syntax. Read more here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table-hiveformat.htmlThe issue mentioned in t...

0 kudos

06-24-2021 2:24:16 AM

by aladda • Databricks Employee

06-22-2021 9:04:54 PM

1130 Views
1 replies
0 kudos

Can I convert an existing Parquet table to Delta without having to copy the data over?

Data Engineering

1130 Views
1 replies
0 kudos

06-22-2021 9:04:54 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:05:46 PM

0 kudos

Yes Convert to Delta allows for converting a parquet table into Delta format in place by adding a transaction log, infering the schema and also collecting stats to improve query performance - https://docs.databricks.com/spark/latest/spark-sql/languag...

0 kudos

06-22-2021 9:05:46 PM

Databricks Community

Resolved! Performance enhancement while writing dataframes into Parquet tables

Partitioned parquet table (folder) with different structure

Refreshing external tables

Resolved! Why do I always get an error on querying the Parquet table - Parquet does not support timestamp

Can I convert an existing Parquet table to Delta without having to copy the data over?