cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

irfanaziz
by Contributor II
  • 18057 Views
  • 7 replies
  • 8 kudos

Resolved! How to merge small parquet files into a single parquet file?

I have thousands of parquet files having same schema and each has 1 or more records. But reading with spark these files is very very slow. I want to know if there is any solution how to merge the files before reading them with spark? Or is there any ...

  • 18057 Views
  • 7 replies
  • 8 kudos
Latest Reply
mmore500
New Contributor II
  • 8 kudos

Give [*joinem*](https://github.com/mmore500/joinem) a try, available via PyPi: `python3 -m pip install joinem`.*joinem* provides a CLI for fast, flexbile concatenation of tabular data using [polars](https://pola.rs).I/O is *lazily streamed* in order ...

  • 8 kudos
6 More Replies
swatish0395
by New Contributor III
  • 571 Views
  • 0 replies
  • 0 kudos

i am working on the parquet file level column encryption and decryption on user specific permission

i am able to encrypt and decrypt the daat in multiple ways and able to save the encrypted parquet file, but i want to decrypt the data if the user has specific permission otherwise he will get the encrypted data,.is there any permanent solution to de...

  • 571 Views
  • 0 replies
  • 0 kudos
alm
by New Contributor III
  • 2133 Views
  • 2 replies
  • 2 kudos

Resolved! Vectorized reading of parquet file containing decimal type column(s)

I was trying to read a parquet file, and write to a delta table, with a parquet file that contains decimal type columns. I encountered a problem that is pretty neatly described by this kb.databricks article, and which I solved by disabling the vector...

  • 2133 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Alberte Mørk​ :The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for th...

  • 2 kudos
1 More Replies
uv
by New Contributor II
  • 2706 Views
  • 3 replies
  • 2 kudos

Parquet to csv delta file

Hi Team, I have a parquet file in s3 bucket which is a delta file I am able to read it but I am unable to write it as a csv file.​getting the following error when i am trying to write:​A transaction log for Databricks Delta was found at `s3://path/a...

  • 2706 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @yuvesh kotiala​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 2 kudos
2 More Replies
ramz
by New Contributor II
  • 1827 Views
  • 4 replies
  • 1 kudos

High driver memory usage on loading parquet file

Hi, I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G. My setup:I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is th...

  • 1827 Views
  • 4 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @ramz siva​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wi...

  • 1 kudos
3 More Replies
BL
by New Contributor III
  • 2776 Views
  • 4 replies
  • 3 kudos

Error reading in Parquet file

I am trying to read a .parqest file from a ADLS gen2 location in azure databricks . But facing the below error:spark.read.parquet("abfss://............/..._2023-01-14T08:01:29.8549884Z.parquet")org.apache.spark.SparkException: Job aborted due to stag...

  • 2776 Views
  • 4 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Moderator
  • 3 kudos

Can you access the executor logs? When you cluster is up and running, you can access the executor's logs. For example, the error shows:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent ...

  • 3 kudos
3 More Replies
wyzer
by Contributor II
  • 2591 Views
  • 2 replies
  • 12 kudos

Resolved! Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code:df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")(Inside the Voucher folder, there is one folder by date. Each one containing one parquet file)How can I add a column into this DataFrame, that...

  • 2591 Views
  • 2 replies
  • 12 kudos
Latest Reply
wyzer
Contributor II
  • 12 kudos

Thanks @Michail Karamanos​ 

  • 12 kudos
1 More Replies
kkumar
by New Contributor III
  • 13822 Views
  • 3 replies
  • 7 kudos

Resolved! can we update a Parquet file??

i have copied a table in to a Parquet file now can i update a row or a column in a parquet file without rewriting all the data as the data is huge.using Databricks or ADFThank You

  • 13822 Views
  • 3 replies
  • 7 kudos
Latest Reply
youssefmrini
Honored Contributor III
  • 7 kudos

You can only append Data with Parquet that's why you need to convert your parquet table to Delta. It will be much easier.

  • 7 kudos
2 More Replies
ricperelli
by New Contributor II
  • 1500 Views
  • 0 replies
  • 1 kudos

How can i save a parquet file using pandas with a data factory orchestrated notebook?

Hi guys,this is my first question, feel free to correct me if i'm doing something wrong.Anyway, i'm facing a really strange problem, i have a notebook in which i'm performing some pandas analysis, after that i save the resulting dataframe in a parque...

  • 1500 Views
  • 0 replies
  • 1 kudos
learnerbricks
by New Contributor II
  • 753 Views
  • 2 replies
  • 0 kudos

how should I start databricks ?

Hello Guys,I am new to databricks. I have try to read the documentation as much I can. Now I want to jump in. What I Want : I have store my parquet file in Databricks storage system. I want to load this file into Data Lake Table. And then want to do ...

  • 753 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Learner bricks​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 0 kudos
1 More Replies
ta_db
by New Contributor
  • 1040 Views
  • 2 replies
  • 0 kudos

Databricks SQL Endpoint Failing to create an external table on a parquet file with Decimal or Timestamp datatype

I'm using the Databricks SQL Endpoint and I'm attempting to create an external table on top of an existing parquet file. I can do this so long as my table definition does not include a reference to a decimal or timestamp/date datatype.ex. This worksC...

  • 1040 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hey there @T A​ Hope everything is going great!Does @Kaniz Fatma​'s response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly? If not, would you be happy to give us more info...

  • 0 kudos
1 More Replies
irfanaziz
by Contributor II
  • 5200 Views
  • 3 replies
  • 2 kudos

Resolved! Issue in reading parquet file in pyspark databricks.

One of the source systems generates from time to time a parquet file which is only 220kb in size.But reading it fails."java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquetCaused by: org.apache.spark.sql.AnalysisExce...

  • 5200 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@nafri A​ - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek​'s answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks

  • 2 kudos
2 More Replies
Nazar
by New Contributor II
  • 3550 Views
  • 5 replies
  • 5 kudos

Resolved! Incremental write

Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...

  • 3550 Views
  • 5 replies
  • 5 kudos
Latest Reply
Nazar
New Contributor II
  • 5 kudos

Thanks werners

  • 5 kudos
4 More Replies
User16790091296
by Contributor II
  • 1174 Views
  • 0 replies
  • 1 kudos

What is the most efficient way to read in a partitioned parquet file with pyspark?

I work with parquet files stored in AWS S3 buckets. They are multiple TB in size and partitioned by a numeric column containing integer values between 1 and 200, call it my_partition. I read in and perform compute actions on this data in Databricks w...

  • 1174 Views
  • 0 replies
  • 1 kudos
Labels