Data Engineering

Forum Posts

Sorted by:

by Erik_L • Contributor II

01-31-2023 5:31:49 PM

4909 Views
4 replies
4 kudos

Resolved! Support for Parquet brotli compression or a work around

Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLIExample code:df = (spark.read.format("parquet") .option("compression", "brotli")...

Data Engineering

4909 Views
4 replies
4 kudos

01-31-2023 5:31:49 PM

View Replies

Latest Reply

Erik_L
Contributor II

02-01-2023 1:48:21 PM

4 kudos

Given the new information I appended, I looked into the Delta caching and I can disable it:.option("spark.databricks.io.cache.enabled", False)This works as a work around while I read these files in to save them locally in DBFS, but does it have perfo...

4 kudos

02-01-2023 1:48:21 PM

3 More Replies

by johnb1 • Contributor

11-30-2022 11:20:52 AM

26472 Views
16 replies
15 kudos

Problems with pandas.read_parquet() and path

I am doing the "Data Engineering with Databricks V2" learning path.I cannot run "DE 4.2 - Providing Options for External Sources", as the first code cell does not run successful:%run ../Includes/Classroom-Setup-04.2Screenshot 1: Inside the setup note...

Data Engineering

26472 Views
16 replies
15 kudos

11-30-2022 11:20:52 AM

View Replies

Latest Reply

hebied
New Contributor II

10-24-2024 11:45:53 PM

15 kudos

Thanks for sharing bro ..It really helped.

15 kudos

10-24-2024 11:45:53 PM

15 More Replies

by RobertWalsh • New Contributor II

09-06-2015 1:07:57 PM

9435 Views
7 replies
2 kudos

Resolved! Hive Table Creation - Parquet does not support Timestamp Datatype?

Good afternoon, Attempting to run this statement: %sql CREATE EXTERNAL TABLE IF NOT EXISTS dev_user_login ( event_name STRING, datetime TIMESTAMP, ip_address STRING, acting_user_id STRING ) PARTITIONED BY (date DATE) STORED AS PARQUET ...

Data Engineering

9435 Views
7 replies
2 kudos

09-06-2015 1:07:57 PM

View Replies

Latest Reply

source2sea
Contributor

10-23-2024 4:40:30 AM

2 kudos

1. change to spark native catalog approach (not hive metadata store) works. Syntax is essentially: CREATE TABLE IF NOT EXISTS dbName.tableName (columns names and types ) USING parquet PARTITIONED BY ( runAt STRING ) LOCA...

2 kudos

10-23-2024 4:40:30 AM

6 More Replies

by MBV3 • New Contributor III

10-31-2022 9:46:03 AM

13036 Views
5 replies
7 kudos

Resolved! External table from parquet partition

Hi,I have data in parquet format in GCS buckets partitioned by name eg. gs://mybucket/name=ABCD/I am trying to create a table in Databaricks as followsDROP TABLE IF EXISTS name_test; CREATE TABLE name_testUSING parquetLOCATION "gs://mybucket/name=*/...

Data Engineering

13036 Views
5 replies
7 kudos

10-31-2022 9:46:03 AM

View Replies

Latest Reply

Pat
Honored Contributor III

11-01-2022 1:46:26 AM

7 kudos

Hi @M Baig ,the error doesn't tell me much, but you could try:CREATE TABLE name_test USING parquet PARTITIONED BY ( name STRING) LOCATION "gs://mybucket/";

7 kudos

11-01-2022 1:46:26 AM

4 More Replies

by f2008700 • New Contributor III

06-04-2023 11:51:00 PM

16711 Views
6 replies
7 kudos

Configuring average parquet file size

I have S3 as a data source containing sample TPC dataset (10G, 100G).I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?I also need the data to be partitioned. And withi...

Data Engineering

16711 Views
6 replies
7 kudos

06-04-2023 11:51:00 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-09-2023 7:37:16 PM

7 kudos

Hi @Vikas Goel We haven't heard from you since the last response from @Werner Stinckens , and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to o...

7 kudos

06-09-2023 7:37:16 PM

5 More Replies

by swatish0395 • New Contributor III

05-23-2023 11:04:58 AM

1210 Views
0 replies
0 kudos

i am working on the parquet file level column encryption and decryption on user specific permission

i am able to encrypt and decrypt the daat in multiple ways and able to save the encrypted parquet file, but i want to decrypt the data if the user has specific permission otherwise he will get the encrypted data,.is there any permanent solution to de...

Data Engineering

1210 Views
0 replies
0 kudos

05-23-2023 11:04:58 AM

by Swaroop • New Contributor

05-19-2023 5:00:48 AM

974 Views
0 replies
0 kudos

How to receive data from azure event hub in parquet ?

import asyncioimport osfrom azure.eventhub.aio import EventHubConsumerClientCONNECTION_STR = "Connection_string"EVENTHUB_NAME = "event_hub"async def on_event(partition_context, event): # Put your code here. # If the operation is i/o intensive, ...

Data Engineering

974 Views
0 replies
0 kudos

05-19-2023 5:00:48 AM

by Ojas1990 • New Contributor

05-10-2023 12:17:20 PM

1259 Views
0 replies
0 kudos

Why not choose ORC over Parquet?

What Spark/Delta Lake choose ORC vs Parquet file format? I learnt ORC is much faster when querying, It is much compression efficient than parquet and has most the feature which parquet has on top of it? Why not choose ORC? Am I missing something? Ple...

Data Engineering

1259 Views
0 replies
0 kudos

05-10-2023 12:17:20 PM

by Gaurav_784295 • New Contributor III

02-22-2023 10:51:18 AM

2006 Views
2 replies
1 kudos

In delta while query on delta unable to see previous partition where as while reading data using parquet file format it is showing whole partition data column .

In delta while query on delta unable to see previous partition where as while reading data using parquet file format it is showing whole partition data column .Delta Format = spark.read.format("delta").load("") Parquet Format ==> spark.read.parquet("...

Data Engineering

2006 Views
2 replies
1 kudos

02-22-2023 10:51:18 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-23-2023 11:56:50 PM

1 kudos

Hi @Gaurav Rawat Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

1 kudos

04-23-2023 11:56:50 PM

1 More Replies

by uv • New Contributor II

03-26-2023 8:51:04 PM

6294 Views
3 replies
2 kudos

Parquet to csv delta file

Hi Team, I have a parquet file in s3 bucket which is a delta file I am able to read it but I am unable to write it as a csv file.getting the following error when i am trying to write:A transaction log for Databricks Delta was found at `s3://path/a...

Data Engineering

6294 Views
3 replies
2 kudos

03-26-2023 8:51:04 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-03-2023 11:40:42 PM

2 kudos

Hi @yuvesh kotiala Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

2 kudos

04-03-2023 11:40:42 PM

2 More Replies

by Erik_L • Contributor II

03-22-2023 2:03:21 PM

3804 Views
2 replies
1 kudos

Resolved! Pyspark read multiple Parquet type expansion failure

ProblemReading nearly equivalent parquet tables in a directory with some having column X with type float and some with type double fails.Attempts at resolvingUsing streaming filesRemoving delta caching, vectorizationUsing ,cache() explicitlyNotesThis...

Data Engineering

3804 Views
2 replies
1 kudos

03-22-2023 2:03:21 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-29-2023 9:45:38 PM

1 kudos

Hi @Erik Louie Help us build a vibrant and resourceful community by recognizing and highlighting insightful contributions. Mark the best answers and show your appreciation!Regards

1 kudos

03-29-2023 9:45:38 PM

1 More Replies

by shiva12494 • New Contributor II

03-14-2023 8:44:46 AM

4775 Views
2 replies
2 kudos

Issue with reading exported tables stored in parquet

Hi All, I am exported all tables from postgres snapshot into S3 in parquet format. I am trying to read the table using databricks and i am unable to do so. I get the following error: "Unable to infer schema for Parquet. It must be specified manually....

Data Engineering

4775 Views
2 replies
2 kudos

03-14-2023 8:44:46 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-25-2023 3:43:08 AM

2 kudos

Hi @shiva charan velichala Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that bes...

2 kudos

03-25-2023 3:43:08 AM

1 More Replies

by JacintoArias • New Contributor III

01-28-2022 2:17:02 AM

8120 Views
5 replies
1 kudos

Spark predicate pushdown on parquet files when using limit

Hi,While developing an ETL for a large dataset I want to get a sample of the top rows to check that my the pipeline "just runs", so I add a limit clause when reading the dataset.I'm surprised to see that instead of creating a single task as in a sho...

Data Engineering

8120 Views
5 replies
1 kudos

01-28-2022 2:17:02 AM

View Replies

Latest Reply

JacekLaskowski
New Contributor III

03-13-2023 6:34:27 AM

1 kudos

It's been a while since the question was asked, and in the meantime Delta Lake 2.2.0 hit the shelves with the exact feature the OP asked about, i.e. LIMIT pushdown:LIMIT pushdown into Delta scan. Improve the performance of queries containing LIMIT cl...

1 kudos

03-13-2023 6:34:27 AM

4 More Replies

by Bie1234 • New Contributor III

02-15-2023 12:51:51 AM

2634 Views
2 replies
3 kudos

Resolved! accidently delete paquet file in dbfs

I accidently delete manual paquet file in dbfs how can I recovery this recovery this file

Data Engineering

2634 Views
2 replies
3 kudos

02-15-2023 12:51:51 AM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

02-15-2023 1:24:20 AM

3 kudos

Hi @pansiri panaudom ,There is no option restore deleted files in databricks .

3 kudos

02-15-2023 1:24:20 AM

1 More Replies

by explorer • New Contributor III

01-11-2023 4:10:40 AM

6050 Views
4 replies
3 kudos

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...

Data Engineering

6050 Views
4 replies
3 kudos

01-11-2023 4:10:40 AM

View Replies

Latest Reply

explorer
New Contributor III

01-18-2023 7:44:11 AM

3 kudos

Hi @Kaniz Fatma , @Daniel Sahal - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...

3 kudos

01-18-2023 7:44:11 AM

3 More Replies