cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

MBV3
by New Contributor III
  • 6848 Views
  • 6 replies
  • 7 kudos

Resolved! External table from parquet partition

Hi,I have data in parquet format in GCS buckets partitioned by name eg. gs://mybucket/name=ABCD/I am trying to create a table in Databaricks as followsDROP TABLE IF EXISTS name_test; CREATE TABLE name_testUSING parquetLOCATION "gs://mybucket/name=*/...

  • 6848 Views
  • 6 replies
  • 7 kudos
Latest Reply
Pat
Honored Contributor III
  • 7 kudos

Hi @M Baig​ ,the error doesn't tell me much, but you could try:CREATE TABLE name_test USING parquet PARTITIONED BY ( name STRING) LOCATION "gs://mybucket/";

  • 7 kudos
5 More Replies
johnb1
by New Contributor III
  • 10027 Views
  • 12 replies
  • 7 kudos

Problems with pandas.read_parquet() and path

I am doing the "Data Engineering with Databricks V2" learning path.I cannot run "DE 4.2 - Providing Options for External Sources", as the first code cell does not run successful:%run ../Includes/Classroom-Setup-04.2Screenshot 1: Inside the setup note...

MicrosoftTeams-image MicrosoftTeams-image (1) Capture Capture_2
  • 10027 Views
  • 12 replies
  • 7 kudos
Latest Reply
jonathanchcc
New Contributor III
  • 7 kudos

Thanks for sharing this helped me too 

  • 7 kudos
11 More Replies
f2008700
by New Contributor III
  • 4787 Views
  • 7 replies
  • 7 kudos

Configuring average parquet file size

I have S3 as a data source containing sample TPC dataset (10G, 100G).I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?I also need the data to be partitioned. And withi...

  • 4787 Views
  • 7 replies
  • 7 kudos
Latest Reply
Anonymous
Not applicable
  • 7 kudos

Hi @Vikas Goel​ We haven't heard from you since the last response from @Werner Stinckens​ ​, and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be helpful to o...

  • 7 kudos
6 More Replies
swatish0395
by New Contributor III
  • 623 Views
  • 0 replies
  • 0 kudos

i am working on the parquet file level column encryption and decryption on user specific permission

i am able to encrypt and decrypt the daat in multiple ways and able to save the encrypted parquet file, but i want to decrypt the data if the user has specific permission otherwise he will get the encrypted data,.is there any permanent solution to de...

  • 623 Views
  • 0 replies
  • 0 kudos
Swaroop
by New Contributor
  • 469 Views
  • 0 replies
  • 0 kudos

How to receive data from azure event hub in parquet ?

import asyncioimport osfrom azure.eventhub.aio import EventHubConsumerClientCONNECTION_STR = "Connection_string"EVENTHUB_NAME = "event_hub"async def on_event(partition_context, event):    # Put your code here.    # If the operation is i/o intensive, ...

  • 469 Views
  • 0 replies
  • 0 kudos
Ojas1990
by New Contributor
  • 665 Views
  • 0 replies
  • 0 kudos

Why not choose ORC over Parquet?

What Spark/Delta Lake choose ORC vs Parquet file format? I learnt ORC is much faster when querying, It is much compression efficient than parquet and has most the feature which parquet has on top of it? Why not choose ORC? Am I missing something? Ple...

  • 665 Views
  • 0 replies
  • 0 kudos
Gaurav_784295
by New Contributor III
  • 1160 Views
  • 2 replies
  • 1 kudos

In delta while query on delta unable to see previous partition where as while reading data using parquet file format it is showing whole partition data column .

In delta while query on delta unable to see previous partition where as while reading data using parquet file format it is showing whole partition data column .Delta Format = spark.read.format("delta").load("") Parquet Format ==> spark.read.parquet("...

While reading through parquet Delta_Table_Screenshot
  • 1160 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Gaurav Rawat​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

  • 1 kudos
1 More Replies
uv
by New Contributor II
  • 3009 Views
  • 3 replies
  • 2 kudos

Parquet to csv delta file

Hi Team, I have a parquet file in s3 bucket which is a delta file I am able to read it but I am unable to write it as a csv file.​getting the following error when i am trying to write:​A transaction log for Databricks Delta was found at `s3://path/a...

  • 3009 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @yuvesh kotiala​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 2 kudos
2 More Replies
Erik_L
by Contributor II
  • 1458 Views
  • 2 replies
  • 1 kudos

Resolved! Pyspark read multiple Parquet type expansion failure

ProblemReading nearly equivalent parquet tables in a directory with some having column X with type float and some with type double fails.Attempts at resolvingUsing streaming filesRemoving delta caching, vectorizationUsing ,cache() explicitlyNotesThis...

  • 1458 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Erik Louie​ Help us build a vibrant and resourceful community by recognizing and highlighting insightful contributions. Mark the best answers and show your appreciation!Regards

  • 1 kudos
1 More Replies
shiva12494
by New Contributor II
  • 1780 Views
  • 2 replies
  • 2 kudos

Issue with reading exported tables stored in parquet

Hi All, I am exported all tables from postgres snapshot into S3 in parquet format. I am trying to read the table using databricks and i am unable to do so. I get the following error: "Unable to infer schema for Parquet. It must be specified manually....

  • 1780 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @shiva charan velichala​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that bes...

  • 2 kudos
1 More Replies
JacintoArias
by New Contributor II
  • 4084 Views
  • 6 replies
  • 2 kudos

Resolved! Spark predicate pushdown on parquet files when using limit

Hi,While developing an ETL for a large dataset I want to get a sample of the top rows to check that my the pipeline "just runs", so I add a limit clause when reading the dataset.I'm surprised to see that instead of creating a single task as in a sho...

  • 4084 Views
  • 6 replies
  • 2 kudos
Latest Reply
JacekLaskowski
New Contributor II
  • 2 kudos

It's been a while since the question was asked, and in the meantime Delta Lake 2.2.0 hit the shelves with the exact feature the OP asked about, i.e. LIMIT pushdown:LIMIT pushdown into Delta scan. Improve the performance of queries containing LIMIT cl...

  • 2 kudos
5 More Replies
Bie1234
by New Contributor III
  • 1367 Views
  • 2 replies
  • 3 kudos

Resolved! accidently delete paquet file in dbfs

I accidently  delete manual paquet file in dbfs how can I recovery this recovery this file

  • 1367 Views
  • 2 replies
  • 3 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 3 kudos

Hi @pansiri panaudom​ ,There is no option restore deleted files in databricks .

  • 3 kudos
1 More Replies
Erik_L
by Contributor II
  • 2139 Views
  • 3 replies
  • 4 kudos

Resolved! Support for Parquet brotli compression or a work around

Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLIExample code:df = (spark.read.format("parquet") .option("compression", "brotli")...

  • 2139 Views
  • 3 replies
  • 4 kudos
Latest Reply
Erik_L
Contributor II
  • 4 kudos

Given the new information I appended, I looked into the Delta caching and I can disable it:.option("spark.databricks.io.cache.enabled", False)This works as a work around while I read these files in to save them locally in DBFS, but does it have perfo...

  • 4 kudos
2 More Replies
explorer
by New Contributor III
  • 2988 Views
  • 6 replies
  • 3 kudos

Getting error while loading parquet data into Postgres (using spark-postgres library) ClassNotFoundException: Failed to find data source: postgres. Please find packages at http://spark.apache.org/third-party-projects.html Caused by: ClassNotFoundException

Hi Fellas - I'm trying to load parquet data (in GCS location) into Postgres DB (google cloud) . For bulk upload data into PG we are using (spark-postgres library)https://framagit.org/interhop/library/spark-etl/-/tree/master/spark-postgres/src/main/sc...

  • 2988 Views
  • 6 replies
  • 3 kudos
Latest Reply
explorer
New Contributor III
  • 3 kudos

Hi @Kaniz Fatma​ , @Daniel Sahal​ - Few updates from my side.After so many hits and trials , psycopg2 worked out in my case.We can process 200+GB data with 10 node cluster (n2-highmem-4,32 GB Memory, 4 Cores) and driver 32 GB Memory, 4 Cores with Run...

  • 3 kudos
5 More Replies
andrew0117
by Contributor
  • 594 Views
  • 1 replies
  • 2 kudos

How to sync the meta store info with the real data for external delta table

if I manually delete some parque files in location which the real data is stored in, so spark catalog still has the old version. How can I sync them?Thanks!

  • 594 Views
  • 1 replies
  • 2 kudos
Latest Reply
youssefmrini
Honored Contributor III
  • 2 kudos

You just need to create a new table and specify the location of the data for your case it's going to be an ADLS, S3...Example​Create table customer using delta location 'mnt/data./'

  • 2 kudos
Labels