Data Engineering

Forum Posts

Sorted by:

by irfanaziz • Contributor II

11-03-2021 12:58:01 PM

29266 Views
8 replies
8 kudos

Resolved! How to merge small parquet files into a single parquet file?

I have thousands of parquet files having same schema and each has 1 or more records. But reading with spark these files is very very slow. I want to know if there is any solution how to merge the files before reading them with spark? Or is there any ...

Data Engineering

29266 Views
8 replies
8 kudos

11-03-2021 12:58:01 PM

View Replies

Latest Reply

Sailaja
New Contributor II

10-17-2024 11:01:16 AM

8 kudos

We can combine all these small parquet files into single file using optimize command..Optimize delta_table_name

8 kudos

10-17-2024 11:01:16 AM

7 More Replies

by swatish0395 • New Contributor III

05-23-2023 11:04:58 AM

1132 Views
0 replies
0 kudos

i am working on the parquet file level column encryption and decryption on user specific permission

i am able to encrypt and decrypt the daat in multiple ways and able to save the encrypted parquet file, but i want to decrypt the data if the user has specific permission otherwise he will get the encrypted data,.is there any permanent solution to de...

Data Engineering

1132 Views
0 replies
0 kudos

05-23-2023 11:04:58 AM

by alm • New Contributor III

04-11-2023 4:51:59 AM

5074 Views
2 replies
2 kudos

Resolved! Vectorized reading of parquet file containing decimal type column(s)

I was trying to read a parquet file, and write to a delta table, with a parquet file that contains decimal type columns. I encountered a problem that is pretty neatly described by this kb.databricks article, and which I solved by disabling the vector...

Data Engineering

5074 Views
2 replies
2 kudos

04-11-2023 4:51:59 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-15-2023 6:07:25 PM

2 kudos

@Alberte Mørk :The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for th...

2 kudos

04-15-2023 6:07:25 PM

1 More Replies

by uv • New Contributor II

03-26-2023 8:51:04 PM

6031 Views
3 replies
2 kudos

Parquet to csv delta file

Hi Team, I have a parquet file in s3 bucket which is a delta file I am able to read it but I am unable to write it as a csv file.getting the following error when i am trying to write:A transaction log for Databricks Delta was found at `s3://path/a...

Data Engineering

6031 Views
3 replies
2 kudos

03-26-2023 8:51:04 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-03-2023 11:40:42 PM

2 kudos

Hi @yuvesh kotiala Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

2 kudos

04-03-2023 11:40:42 PM

2 More Replies

by ramz • New Contributor II

03-07-2023 12:40:33 AM

3363 Views
4 replies
1 kudos

High driver memory usage on loading parquet file

Hi, I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G. My setup:I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is th...

Data Engineering

3363 Views
4 replies
1 kudos

03-07-2023 12:40:33 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-31-2023 5:57:17 PM

1 kudos

Hi @ramz siva Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wi...

1 kudos

03-31-2023 5:57:17 PM

3 More Replies

by BL • New Contributor III

01-14-2023 4:09:25 AM

4843 Views
4 replies
3 kudos

Error reading in Parquet file

I am trying to read a .parqest file from a ADLS gen2 location in azure databricks . But facing the below error:spark.read.parquet("abfss://............/..._2023-01-14T08:01:29.8549884Z.parquet")org.apache.spark.SparkException: Job aborted due to stag...

Data Engineering

4843 Views
4 replies
3 kudos

01-14-2023 4:09:25 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

01-30-2023 2:51:18 PM

3 kudos

Can you access the executor logs? When you cluster is up and running, you can access the executor's logs. For example, the error shows:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent ...

3 kudos

01-30-2023 2:51:18 PM

3 More Replies

by wyzer • Contributor II

11-18-2022 8:25:08 AM

5392 Views
2 replies
12 kudos

Resolved! Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code:df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")(Inside the Voucher folder, there is one folder by date. Each one containing one parquet file)How can I add a column into this DataFrame, that...

Data Engineering

5392 Views
2 replies
12 kudos

11-18-2022 8:25:08 AM

View Replies

Latest Reply

wyzer
Contributor II

11-18-2022 12:46:00 PM

12 kudos

Thanks @Michail Karamanos

12 kudos

11-18-2022 12:46:00 PM

1 More Replies

by kkumar • New Contributor III

11-14-2022 11:38:59 PM

20302 Views
3 replies
7 kudos

Resolved! can we update a Parquet file??

i have copied a table in to a Parquet file now can i update a row or a column in a parquet file without rewriting all the data as the data is huge.using Databricks or ADFThank You

Data Engineering

20302 Views
3 replies
7 kudos

11-14-2022 11:38:59 PM

View Replies

Latest Reply

youssefmrini
Databricks Employee

11-15-2022 6:31:30 AM

7 kudos

You can only append Data with Parquet that's why you need to convert your parquet table to Delta. It will be much easier.

7 kudos

11-15-2022 6:31:30 AM

2 More Replies

by ricperelli • New Contributor II

10-19-2022 5:38:26 AM

2199 Views
0 replies
1 kudos

How can i save a parquet file using pandas with a data factory orchestrated notebook?

Hi guys,this is my first question, feel free to correct me if i'm doing something wrong.Anyway, i'm facing a really strange problem, i have a notebook in which i'm performing some pandas analysis, after that i save the resulting dataframe in a parque...

Data Engineering

2199 Views
0 replies
1 kudos

10-19-2022 5:38:26 AM

by learnerbricks • New Contributor II

09-09-2022 5:48:22 AM

1392 Views
2 replies
0 kudos

how should I start databricks ?

Hello Guys,I am new to databricks. I have try to read the documentation as much I can. Now I want to jump in. What I Want : I have store my parquet file in Databricks storage system. I want to load this file into Data Lake Table. And then want to do ...

Data Engineering

1392 Views
2 replies
0 kudos

09-09-2022 5:48:22 AM

View Replies

Latest Reply

Anonymous
Not applicable

09-23-2022 11:40:56 PM

0 kudos

Hi @Learner bricks Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

0 kudos

09-23-2022 11:40:56 PM

1 More Replies

by ta_db • New Contributor

05-25-2022 7:12:11 PM

1734 Views
1 replies
0 kudos

Databricks SQL Endpoint Failing to create an external table on a parquet file with Decimal or Timestamp datatype

I'm using the Databricks SQL Endpoint and I'm attempting to create an external table on top of an existing parquet file. I can do this so long as my table definition does not include a reference to a decimal or timestamp/date datatype.ex. This worksC...

Data Engineering

1734 Views
1 replies
0 kudos

05-25-2022 7:12:11 PM

View Replies

Latest Reply

Anonymous
Not applicable

07-25-2022 9:31:29 AM

0 kudos

Hey there @T A Hope everything is going great!Does @Kaniz Fatma's response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly? If not, would you be happy to give us more info...

0 kudos

07-25-2022 9:31:29 AM

by irfanaziz • Contributor II

01-17-2022 7:49:47 AM

9123 Views
3 replies
2 kudos

Resolved! Issue in reading parquet file in pyspark databricks.

One of the source systems generates from time to time a parquet file which is only 220kb in size.But reading it fails."java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquetCaused by: org.apache.spark.sql.AnalysisExce...

Data Engineering

9123 Views
3 replies
2 kudos

01-17-2022 7:49:47 AM

View Replies

Latest Reply

Anonymous
Not applicable

02-09-2022 8:13:04 AM

2 kudos

@nafri A - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek's answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks

2 kudos

02-09-2022 8:13:04 AM

2 More Replies

by Nazar • New Contributor II

09-23-2021 3:06:15 PM

5858 Views
3 replies
4 kudos

Resolved! Incremental write

Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...

Data Engineering

5858 Views
3 replies
4 kudos

09-23-2021 3:06:15 PM

View Replies

Latest Reply

Nazar
New Contributor II

09-27-2021 2:55:33 PM

4 kudos

Thanks werners

4 kudos

09-27-2021 2:55:33 PM

2 More Replies

by User16790091296 • Contributor II

06-24-2021 8:09:20 AM

1780 Views
0 replies
1 kudos

What is the most efficient way to read in a partitioned parquet file with pyspark?

I work with parquet files stored in AWS S3 buckets. They are multiple TB in size and partitioned by a numeric column containing integer values between 1 and 200, call it my_partition. I read in and perform compute actions on this data in Databricks w...

Data Engineering

1780 Views
0 replies
1 kudos

06-24-2021 8:09:20 AM