Data Engineering

Forum Posts

Sorted by:

by Erik_L • Contributor II

01-31-2023 5:31:49 PM

4793 Views
4 replies
4 kudos

Resolved! Support for Parquet brotli compression or a work around

Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLIExample code:df = (spark.read.format("parquet") .option("compression", "brotli")...

Data Engineering

4793 Views
4 replies
4 kudos

01-31-2023 5:31:49 PM

View Replies

Latest Reply

Erik_L
Contributor II

02-01-2023 1:48:21 PM

4 kudos

Given the new information I appended, I looked into the Delta caching and I can disable it:.option("spark.databricks.io.cache.enabled", False)This works as a work around while I read these files in to save them locally in DBFS, but does it have perfo...

4 kudos

02-01-2023 1:48:21 PM

3 More Replies

by f2008700 • New Contributor III

06-04-2023 11:51:00 PM

16415 Views
6 replies
7 kudos

Configuring average parquet file size

I have S3 as a data source containing sample TPC dataset (10G, 100G).I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?I also need the data to be partitioned. And withi...

Data Engineering

16415 Views
6 replies
7 kudos

06-04-2023 11:51:00 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-09-2023 7:37:16 PM

7 kudos

Hi @Vikas Goel We haven't heard from you since the last response from @Werner Stinckens , and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to o...

7 kudos

06-09-2023 7:37:16 PM

5 More Replies

by pinaki1 • New Contributor III

05-01-2023 11:36:59 PM

4576 Views
5 replies
0 kudos

connect rds from databricks sql editor

Is it possible to connect and execute query directly from rds in sql editor without using unity catelog

Data Engineering

4576 Views
5 replies
0 kudos

05-01-2023 11:36:59 PM

View Replies

Latest Reply

luis_herrera
Databricks Employee

05-03-2023 4:41:01 AM

0 kudos

Hi there, Yes, you could do federated queries from DB SQL Editor. This is an experimental feature, though. UC is actually not supported. You can read more here:https://docs.databricks.com/query-federation/index.htmlPS: check out #DAIS2023 talks

0 kudos

05-03-2023 4:41:01 AM

4 More Replies

by Pbarbosa154 • New Contributor III

04-28-2023 7:30:44 AM

1414 Views
2 replies
0 kudos

What is the best way to ingest GCS data into Databricks and apply Anomaly Detection Model?

I recently started exploring the field of Data Engineering and came across some difficulties. I have a bucket in GCS with millions of parquet files and I want to create an Anomaly Detection model with them. I was trying to ingest that data into Datab...

Data Engineering

1414 Views
2 replies
0 kudos

04-28-2023 7:30:44 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-28-2023 10:34:53 AM

0 kudos

@Pedro Barbosa :It seems like you are running out of memory when trying to convert the PySpark dataframe to an H2O frame. One possible approach to solve this issue is to partition the PySpark dataframe before converting it to an H2O frame.You can us...

0 kudos

04-28-2023 10:34:53 AM

1 More Replies

by ramz • New Contributor II

03-07-2023 12:40:33 AM

3512 Views
4 replies
1 kudos

High driver memory usage on loading parquet file

Hi, I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G. My setup:I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is th...

Data Engineering

3512 Views
4 replies
1 kudos

03-07-2023 12:40:33 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-31-2023 5:57:17 PM

1 kudos

Hi @ramz siva Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wi...

1 kudos

03-31-2023 5:57:17 PM

3 More Replies

by Erik_L • Contributor II

03-17-2023 11:25:28 AM

7225 Views
1 replies
0 kudos

How to merge parquets with different column types

ProblemI have a directory in S3 with a bunch of data files, like "data-20221101.parquet". They all have the same columns: timestamp, reading_a, reading_b, reading_c. In the earlier files, the readings are floats, but in the later ones they are double...

Data Engineering

7225 Views
1 replies
0 kudos

03-17-2023 11:25:28 AM

View Replies

Latest Reply

mathan_pillai
Databricks Employee

03-22-2023 3:26:15 PM

0 kudos

1) Can you let us know what was the error message when you don't set the schema & use mergeSchema2) What happens when you define schema (with FloatType) & use mergeSchema ? what error message do you get ?

0 kudos

03-22-2023 3:26:15 PM

by JacintoArias • New Contributor III

01-28-2022 2:17:02 AM

7934 Views
5 replies
1 kudos

Spark predicate pushdown on parquet files when using limit

Hi,While developing an ETL for a large dataset I want to get a sample of the top rows to check that my the pipeline "just runs", so I add a limit clause when reading the dataset.I'm surprised to see that instead of creating a single task as in a sho...

Data Engineering

7934 Views
5 replies
1 kudos

01-28-2022 2:17:02 AM

View Replies

Latest Reply

JacekLaskowski
New Contributor III

03-13-2023 6:34:27 AM

1 kudos

It's been a while since the question was asked, and in the meantime Delta Lake 2.2.0 hit the shelves with the exact feature the OP asked about, i.e. LIMIT pushdown:LIMIT pushdown into Delta scan. Improve the performance of queries containing LIMIT cl...

1 kudos

03-13-2023 6:34:27 AM

4 More Replies

by DB_developer • New Contributor III

12-07-2022 1:27:39 AM

5411 Views
3 replies
7 kudos

Resolved! How nulls are stored in delta lake and databricks?

In my findings I have found a lot of delta tables in the lake house to be sparse so just wondering what space data lake takes to store null data and also any suggestions to handle sparse data tables in lake house would be appreciated.I also want to o...

Data Engineering

5411 Views
3 replies
7 kudos

12-07-2022 1:27:39 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

12-07-2022 3:38:52 AM

7 kudos

As delta uses parquet files to store data inside delta:"Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded...

7 kudos

12-07-2022 3:38:52 AM

2 More Replies

by -werners- • Esteemed Contributor III

11-03-2022 7:02:30 AM

2708 Views
2 replies
17 kudos

Autoloader: how to avoid overlap in files

I'm thinking of using autoloader to process files being put on our data lake.Let's say f.e. every 15 minutes, a parquet files is written. These files however contain overlapping data.Now, every 2 hours I want to process the new data (autoloader) and...

Data Engineering

2708 Views
2 replies
17 kudos

11-03-2022 7:02:30 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-03-2022 7:21:29 AM

17 kudos

What about forEachBatch and then MERGE?Alternatively, run another process that will clean updates using the window function, as you said.

17 kudos

11-03-2022 7:21:29 AM

1 More Replies

by elgeo • Valued Contributor II

10-27-2022 5:21:06 AM

1225 Views
0 replies
3 kudos

Number of parquet files per delta table

Hello. We would like to understand how many parquet files are created per data table. To be more specific, we refer to the current snapshot of the table. For example, we noticed that while we performed initial inserts to a table, one parquet file was...

Data Engineering

1225 Views
0 replies
3 kudos

10-27-2022 5:21:06 AM

by Data_Engineer3 • Contributor III

06-11-2022 7:01:55 AM

7038 Views
3 replies
3 kudos

Resolved! Unable to read file from dbfs location in databricks.

When i tried to read file from dbfs, it throws error - Caused by: FileReadException: Error while reading file dbfs:/.......................parquet is not a Parquet file. Expected magic number at tail [80, 65, 82, 49] but found [105, 108, 101, 115].Bu...

Data Engineering

7038 Views
3 replies
3 kudos

06-11-2022 7:01:55 AM

View Replies

by Himanshi • New Contributor III

08-03-2022 11:46:38 PM

1322 Views
0 replies
4 kudos

Databricks streaming job issue with Autoloader for new checkpoint.

Hi Team,I am trying to run a streaming job in databricks, used Autoloader approach for reading the files from the Azure Datalake Gen2 which is in parquet format. I have created a new checkpoint, so first offset is getting created but throwing an erro...

Data Engineering

1322 Views
0 replies
4 kudos

08-03-2022 11:46:38 PM

by Mayank • New Contributor III

06-26-2022 2:54:28 PM

11445 Views
8 replies
4 kudos

Resolved! Unable to load Parquet file using Autoloader. Can someone help?

I am trying to load parquet files using Autoloader. Below is the code def autoload_to_table (data_source, source_format, table_name, checkpoint_path): query = (spark.readStream .format('cloudFiles') .option('cl...

Data Engineering

11445 Views
8 replies
4 kudos

06-26-2022 2:54:28 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-27-2022 11:06:59 AM

4 kudos

Hi again @Mayank Srivastava Thank you so much for getting back to us and marking the answer as best.We really appreciate your time.Wish you a great Databricks journey ahead!

4 kudos

06-27-2022 11:06:59 AM

7 More Replies

by Eyespoop • New Contributor II

06-23-2022 2:59:21 AM

22677 Views
3 replies
4 kudos

Resolved! PySpark: Writing Parquet Files to the Azure Blob Storage Container

Currently I am having some issues with the writing of the parquet file in the Storage Container. I do have the codes running but whenever the dataframe writer puts the parquet to the blob storage instead of the parquet file type, it is created as a f...

Data Engineering

22677 Views
3 replies
4 kudos

06-23-2022 2:59:21 AM

View Replies

Latest Reply

User16764241763
Honored Contributor

06-27-2022 4:29:21 AM

4 kudos

Hello @Karl Saycon Can you try setting this config to prevent additional parquet summary and metadata files from being written? The result from dataframe write to storage should be a single file.https://community.databricks.com/s/question/0D53f00001...

4 kudos

06-27-2022 4:29:21 AM

2 More Replies

by vivek_sinha • Contributor

06-10-2022 5:07:26 PM

8582 Views
3 replies
4 kudos

Resolved! Getting Authentication Error while accessing Azure Blob table (wasb) URL using PySpark

I am trying to access the Azure Blob table using Pyspark but getting an Authentication Error. Here I am passing SAS token (HTTP and HTTPS enabled) but it's working only with WASBS (HTTPS) URL, not with WASB (HTTP) URL.Even I tried with Account key as...

Data Engineering

8582 Views
3 replies
4 kudos

06-10-2022 5:07:26 PM

View Replies

Latest Reply

vivek_sinha
Contributor

06-12-2022 3:42:29 AM

4 kudos

Hi @Arvind Ravish The issue got fixed after passing HTTP and HTTPS enabled token to spark executors.Thanks again for your help

4 kudos

06-12-2022 3:42:29 AM

2 More Replies