Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLIExample code:df = (spark.read.format("parquet")
.option("compression", "brotli")...
Given the new information I appended, I looked into the Delta caching and I can disable it:.option("spark.databricks.io.cache.enabled", False)This works as a work around while I read these files in to save them locally in DBFS, but does it have perfo...
I am doing the "Data Engineering with Databricks V2" learning path.I cannot run "DE 4.2 - Providing Options for External Sources", as the first code cell does not run successful:%run ../Includes/Classroom-Setup-04.2Screenshot 1: Inside the setup note...
Good afternoon,
Attempting to run this statement:
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS dev_user_login (
event_name STRING,
datetime TIMESTAMP,
ip_address STRING,
acting_user_id STRING
)
PARTITIONED BY
(date DATE)
STORED AS
PARQUET
...
1. change to spark native catalog approach (not hive metadata store) works. Syntax is essentially: CREATE TABLE IF NOT EXISTS dbName.tableName (columns names and types
)
USING parquet
PARTITIONED BY (
runAt STRING
)
LOCA...
Hi,I have data in parquet format in GCS buckets partitioned by name eg. gs://mybucket/name=ABCD/I am trying to create a table in Databaricks as followsDROP TABLE IF EXISTS name_test; CREATE TABLE name_testUSING parquetLOCATION "gs://mybucket/name=*/...
Hi @M Baig ,the error doesn't tell me much, but you could try:CREATE TABLE name_test
USING parquet
PARTITIONED BY ( name STRING)
LOCATION "gs://mybucket/";
I have S3 as a data source containing sample TPC dataset (10G, 100G).I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?I also need the data to be partitioned. And withi...
Hi @Vikas Goel We haven't heard from you since the last response from @Werner Stinckens , and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to o...
i am able to encrypt and decrypt the daat in multiple ways and able to save the encrypted parquet file, but i want to decrypt the data if the user has specific permission otherwise he will get the encrypted data,.is there any permanent solution to de...
What Spark/Delta Lake choose ORC vs Parquet file format? I learnt ORC is much faster when querying, It is much compression efficient than parquet and has most the feature which parquet has on top of it? Why not choose ORC? Am I missing something? Ple...
In delta while query on delta unable to see previous partition where as while reading data using parquet file format it is showing whole partition data column .Delta Format = spark.read.format("delta").load("") Parquet Format ==> spark.read.parquet("...
Hi @Gaurav Rawat Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...
Hi Team, I have a parquet file in s3 bucket which is a delta file I am able to read it but I am unable to write it as a csv file.getting the following error when i am trying to write:A transaction log for Databricks Delta was found at `s3://path/a...
Hi @yuvesh kotiala Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...
ProblemReading nearly equivalent parquet tables in a directory with some having column X with type float and some with type double fails.Attempts at resolvingUsing streaming filesRemoving delta caching, vectorizationUsing ,cache() explicitlyNotesThis...
Hi @Erik Louie Help us build a vibrant and resourceful community by recognizing and highlighting insightful contributions. Mark the best answers and show your appreciation!Regards
Hi All, I am exported all tables from postgres snapshot into S3 in parquet format. I am trying to read the table using databricks and i am unable to do so. I get the following error: "Unable to infer schema for Parquet. It must be specified manually....
Hi @shiva charan velichala Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that bes...
Hi,While developing an ETL for a large dataset I want to get a sample of the top rows to check that my the pipeline "just runs", so I add a limit clause when reading the dataset.I'm surprised to see that instead of creating a single task as in a sho...
It's been a while since the question was asked, and in the meantime Delta Lake 2.2.0 hit the shelves with the exact feature the OP asked about, i.e. LIMIT pushdown:LIMIT pushdown into Delta scan. Improve the performance of queries containing LIMIT cl...
Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...
Hi @Kaniz Fatma , @Daniel Sahal - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...