cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

f2008700
by New Contributor III
  • 14118 Views
  • 7 replies
  • 7 kudos

Configuring average parquet file size

I have S3 as a data source containing sample TPC dataset (10G, 100G).I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?I also need the data to be partitioned. And withi...

  • 14118 Views
  • 7 replies
  • 7 kudos
Latest Reply
Anonymous
Not applicable
  • 7 kudos

Hi @Vikas Goel​ We haven't heard from you since the last response from @Werner Stinckens​ â€‹, and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to o...

  • 7 kudos
6 More Replies
pinaki1
by New Contributor III
  • 3709 Views
  • 5 replies
  • 0 kudos

connect rds from databricks sql editor

Is it possible to connect and execute query directly from rds in sql editor without using unity catelog

  • 3709 Views
  • 5 replies
  • 0 kudos
Latest Reply
luis_herrera
Contributor
  • 0 kudos

Hi there, Yes, you could do federated queries from DB SQL Editor. This is an experimental feature, though. UC is actually not supported. You can read more here:https://docs.databricks.com/query-federation/index.htmlPS: check out #DAIS2023 talks

  • 0 kudos
4 More Replies
Pbarbosa154
by New Contributor III
  • 1150 Views
  • 2 replies
  • 0 kudos

What is the best way to ingest GCS data into Databricks and apply Anomaly Detection Model?

I recently started exploring the field of Data Engineering and came across some difficulties. I have a bucket in GCS with millions of parquet files and I want to create an Anomaly Detection model with them. I was trying to ingest that data into Datab...

  • 1150 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Pedro Barbosa​ :It seems like you are running out of memory when trying to convert the PySpark dataframe to an H2O frame. One possible approach to solve this issue is to partition the PySpark dataframe before converting it to an H2O frame.You can us...

  • 0 kudos
1 More Replies
ramz
by New Contributor II
  • 2879 Views
  • 4 replies
  • 1 kudos

High driver memory usage on loading parquet file

Hi, I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G. My setup:I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is th...

  • 2879 Views
  • 4 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @ramz siva​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wi...

  • 1 kudos
3 More Replies
Erik_L
by Contributor II
  • 5283 Views
  • 1 replies
  • 0 kudos

How to merge parquets with different column types

ProblemI have a directory in S3 with a bunch of data files, like "data-20221101.parquet". They all have the same columns: timestamp, reading_a, reading_b, reading_c. In the earlier files, the readings are floats, but in the later ones they are double...

  • 5283 Views
  • 1 replies
  • 0 kudos
Latest Reply
mathan_pillai
Valued Contributor
  • 0 kudos

1) Can you let us know what was the error message when you don't set the schema & use mergeSchema2) What happens when you define schema (with FloatType) & use mergeSchema ? what error message do you get ?

  • 0 kudos
JacintoArias
by New Contributor III
  • 6328 Views
  • 6 replies
  • 2 kudos

Resolved! Spark predicate pushdown on parquet files when using limit

Hi,While developing an ETL for a large dataset I want to get a sample of the top rows to check that my the pipeline "just runs", so I add a limit clause when reading the dataset.I'm surprised to see that instead of creating a single task as in a sho...

  • 6328 Views
  • 6 replies
  • 2 kudos
Latest Reply
JacekLaskowski
New Contributor III
  • 2 kudos

It's been a while since the question was asked, and in the meantime Delta Lake 2.2.0 hit the shelves with the exact feature the OP asked about, i.e. LIMIT pushdown:LIMIT pushdown into Delta scan. Improve the performance of queries containing LIMIT cl...

  • 2 kudos
5 More Replies
Erik_L
by Contributor II
  • 3574 Views
  • 3 replies
  • 4 kudos

Resolved! Support for Parquet brotli compression or a work around

Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLIExample code:df = (spark.read.format("parquet") .option("compression", "brotli")...

  • 3574 Views
  • 3 replies
  • 4 kudos
Latest Reply
Erik_L
Contributor II
  • 4 kudos

Given the new information I appended, I looked into the Delta caching and I can disable it:.option("spark.databricks.io.cache.enabled", False)This works as a work around while I read these files in to save them locally in DBFS, but does it have perfo...

  • 4 kudos
2 More Replies
DB_developer
by New Contributor III
  • 4254 Views
  • 4 replies
  • 7 kudos

Resolved! How nulls are stored in delta lake and databricks?

In my findings I have found a lot of delta tables in the lake house to be sparse so just wondering what space data lake takes to store null data and also any suggestions to handle sparse data tables in lake house would be appreciated.I also want to o...

  • 4254 Views
  • 4 replies
  • 7 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 7 kudos

Hi @Akash Ragothu​, We haven’t heard from you since the last response from @Ajay Pandey​, and I was checking back to see if his suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to othe...

  • 7 kudos
3 More Replies
-werners-
by Esteemed Contributor III
  • 2358 Views
  • 2 replies
  • 17 kudos

Autoloader: how to avoid overlap in files

I'm thinking of using autoloader to process files being put on our data lake.Let's say f.e. every 15 minutes, a parquet files is written. These files however contain overlapping data.Now, every 2 hours I want to process the new data (autoloader) and...

  • 2358 Views
  • 2 replies
  • 17 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 17 kudos

What about forEachBatch and then MERGE?Alternatively, run another process that will clean updates using the window function, as you said.

  • 17 kudos
1 More Replies
elgeo
by Valued Contributor II
  • 961 Views
  • 0 replies
  • 3 kudos

Number of parquet files per delta table

Hello. We would like to understand how many parquet files are created per data table. To be more specific, we refer to the current snapshot of the table. For example, we noticed that while we performed initial inserts to a table, one parquet file was...

  • 961 Views
  • 0 replies
  • 3 kudos
Data_Engineer3
by Contributor III
  • 5966 Views
  • 4 replies
  • 4 kudos

Resolved! Unable to read file from dbfs location in databricks.

When i tried to read file from dbfs, it throws error - Caused by: FileReadException: Error while reading file dbfs:/.......................parquet is not a Parquet file. Expected magic number at tail [80, 65, 82, 49] but found [105, 108, 101, 115].Bu...

  • 5966 Views
  • 4 replies
  • 4 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 4 kudos

Hi @KARTHICK N​, What's the one-line code you're trying to read the file, precisely the path?Can you confirm if your file is a CSV or Parquet file?Are you trying to read it in python or scala?

  • 4 kudos
3 More Replies
Mayank
by New Contributor III
  • 10347 Views
  • 8 replies
  • 4 kudos

Resolved! Unable to load Parquet file using Autoloader. Can someone help?

I am trying to load parquet files using Autoloader. Below is the code def autoload_to_table (data_source, source_format, table_name, checkpoint_path): query = (spark.readStream .format('cloudFiles') .option('cl...

  • 10347 Views
  • 8 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

Hi again @Mayank Srivastava​ Thank you so much for getting back to us and marking the answer as best.We really appreciate your time.Wish you a great Databricks journey ahead!

  • 4 kudos
7 More Replies
Eyespoop
by New Contributor II
  • 17549 Views
  • 3 replies
  • 2 kudos

Resolved! PySpark: Writing Parquet Files to the Azure Blob Storage Container

Currently I am having some issues with the writing of the parquet file in the Storage Container. I do have the codes running but whenever the dataframe writer puts the parquet to the blob storage instead of the parquet file type, it is created as a f...

image image(1) image(2)
  • 17549 Views
  • 3 replies
  • 2 kudos
Latest Reply
User16764241763
Honored Contributor
  • 2 kudos

Hello @Karl Saycon​ Can you try setting this config to prevent additional parquet summary and metadata files from being written? The result from dataframe write to storage should be a single file.https://community.databricks.com/s/question/0D53f00001...

  • 2 kudos
2 More Replies
vivek_sinha
by Contributor
  • 7353 Views
  • 4 replies
  • 4 kudos

Resolved! Getting Authentication Error while accessing Azure Blob table (wasb) URL using PySpark

I am trying to access the Azure Blob table using Pyspark but getting an Authentication Error. Here I am passing SAS token (HTTP and HTTPS enabled) but it's working only with WASBS (HTTPS) URL, not with WASB (HTTP) URL.Even I tried with Account key as...

  • 7353 Views
  • 4 replies
  • 4 kudos
Latest Reply
vivek_sinha
Contributor
  • 4 kudos

Hi @Arvind Ravish​  The issue got fixed after passing HTTP and HTTPS enabled token to spark executors.Thanks again for your help

  • 4 kudos
3 More Replies
Labels