- 18057 Views
- 7 replies
- 8 kudos
I have thousands of parquet files having same schema and each has 1 or more records. But reading with spark these files is very very slow. I want to know if there is any solution how to merge the files before reading them with spark? Or is there any ...
- 18057 Views
- 7 replies
- 8 kudos
Latest Reply
Give [*joinem*](https://github.com/mmore500/joinem) a try, available via PyPi: `python3 -m pip install joinem`.*joinem* provides a CLI for fast, flexbile concatenation of tabular data using [polars](https://pola.rs).I/O is *lazily streamed* in order ...
6 More Replies
- 571 Views
- 0 replies
- 0 kudos
i am able to encrypt and decrypt the daat in multiple ways and able to save the encrypted parquet file, but i want to decrypt the data if the user has specific permission otherwise he will get the encrypted data,.is there any permanent solution to de...
- 571 Views
- 0 replies
- 0 kudos
by
alm
• New Contributor III
- 2133 Views
- 2 replies
- 2 kudos
I was trying to read a parquet file, and write to a delta table, with a parquet file that contains decimal type columns. I encountered a problem that is pretty neatly described by this kb.databricks article, and which I solved by disabling the vector...
- 2133 Views
- 2 replies
- 2 kudos
Latest Reply
@Alberte Mørk​ :The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for th...
1 More Replies
by
uv
• New Contributor II
- 2706 Views
- 3 replies
- 2 kudos
Hi Team, I have a parquet file in s3 bucket which is a delta file I am able to read it but I am unable to write it as a csv file.​getting the following error when i am trying to write:​A transaction log for Databricks Delta was found at `s3://path/a...
- 2706 Views
- 3 replies
- 2 kudos
Latest Reply
Hi @yuvesh kotiala​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...
2 More Replies
by
ramz
• New Contributor II
- 1827 Views
- 4 replies
- 1 kudos
Hi, I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G. My setup:I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is th...
- 1827 Views
- 4 replies
- 1 kudos
Latest Reply
Hi @ramz siva​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wi...
3 More Replies
by
BL
• New Contributor III
- 2776 Views
- 4 replies
- 3 kudos
I am trying to read a .parqest file from a ADLS gen2 location in azure databricks . But facing the below error:spark.read.parquet("abfss://............/..._2023-01-14T08:01:29.8549884Z.parquet")org.apache.spark.SparkException: Job aborted due to stag...
- 2776 Views
- 4 replies
- 3 kudos
Latest Reply
Can you access the executor logs? When you cluster is up and running, you can access the executor's logs. For example, the error shows:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent ...
3 More Replies
by
wyzer
• Contributor II
- 2591 Views
- 2 replies
- 12 kudos
Currently I load multiple parquet file with this code:df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")(Inside the Voucher folder, there is one folder by date. Each one containing one parquet file)How can I add a column into this DataFrame, that...
- 2591 Views
- 2 replies
- 12 kudos
by
kkumar
• New Contributor III
- 13822 Views
- 3 replies
- 7 kudos
i have copied a table in to a Parquet file now can i update a row or a column in a parquet file without rewriting all the data as the data is huge.using Databricks or ADFThank You
- 13822 Views
- 3 replies
- 7 kudos
Latest Reply
You can only append Data with Parquet that's why you need to convert your parquet table to Delta. It will be much easier.
2 More Replies
- 1500 Views
- 0 replies
- 1 kudos
Hi guys,this is my first question, feel free to correct me if i'm doing something wrong.Anyway, i'm facing a really strange problem, i have a notebook in which i'm performing some pandas analysis, after that i save the resulting dataframe in a parque...
- 1500 Views
- 0 replies
- 1 kudos
- 753 Views
- 2 replies
- 0 kudos
Hello Guys,I am new to databricks. I have try to read the documentation as much I can. Now I want to jump in. What I Want : I have store my parquet file in Databricks storage system. I want to load this file into Data Lake Table. And then want to do ...
- 753 Views
- 2 replies
- 0 kudos
Latest Reply
Hi @Learner bricks​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...
1 More Replies
by
ta_db
• New Contributor
- 1040 Views
- 2 replies
- 0 kudos
I'm using the Databricks SQL Endpoint and I'm attempting to create an external table on top of an existing parquet file. I can do this so long as my table definition does not include a reference to a decimal or timestamp/date datatype.ex. This worksC...
- 1040 Views
- 2 replies
- 0 kudos
Latest Reply
Hey there @T A​ Hope everything is going great!Does @Kaniz Fatma​'s response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly? If not, would you be happy to give us more info...
1 More Replies
- 5200 Views
- 3 replies
- 2 kudos
One of the source systems generates from time to time a parquet file which is only 220kb in size.But reading it fails."java.io.IOException: Could not read or convert schema for file: 1-2022-00-51-56.parquetCaused by: org.apache.spark.sql.AnalysisExce...
- 5200 Views
- 3 replies
- 2 kudos
Latest Reply
@nafri A​ - Howdy! My name is Piper, and I'm a community moderator for Databricks. Would you be happy to mark @Hubert Dudek​'s answer as best if it solved the problem? That will help other members find the answer more quickly. Thanks
2 More Replies
by
Nazar
• New Contributor II
- 3550 Views
- 5 replies
- 5 kudos
Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...
- 3550 Views
- 5 replies
- 5 kudos
- 1174 Views
- 0 replies
- 1 kudos
I work with parquet files stored in AWS S3 buckets. They are multiple TB in size and partitioned by a numeric column containing integer values between 1 and 200, call it my_partition. I read in and perform compute actions on this data in Databricks w...
- 1174 Views
- 0 replies
- 1 kudos