cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

laus
by New Contributor III
  • 26256 Views
  • 3 replies
  • 2 kudos

Resolved! get a "Py4JJavaError: An error occurred while calling o5082.csv." when trying to save to csv file.

Hi, I'm trying to save a dataframe to csv with the code below:output.coalesce(1).write.mode('overwrite').option('header', 'true').csv(tmp_file_path) But it get "Py4JJavaError: An error occurred while calling o5082.csv." error. Any idea how to solve...

Screenshot 2022-03-31 at 17.33.13
  • 26256 Views
  • 3 replies
  • 2 kudos
Rahul_Samant
by Contributor
  • 16334 Views
  • 4 replies
  • 4 kudos

Resolved! Bucketing on Delta Tables

getting error as below while creating buckets on delta table.Error in SQL statement: AnalysisException: Delta bucketed tables are not supported.have fall back to parquet table due to this for some use cases. is their any alternative for this. i have...

  • 16334 Views
  • 4 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

Hi @Rahul Samant​  , we checked internally on this due to certain limitations bucketing is not supported on delta tables, the only alternative for bucketing is to leverage the z ordering, below is the link for reference https://docs.databricks.com/de...

  • 4 kudos
3 More Replies
Michael_Galli
by Databricks Partner
  • 6827 Views
  • 3 replies
  • 2 kudos

Resolved! Spark Streaming - only process new files in streaming path?

In our streaming jobs, we currently run streaming (cloudFiles format) on a directory with sales transactions coming every 5 minutes.In this directory, the transactions are ordered in the following format:<streaming-checkpoint-root>/<transaction_date>...

  • 6827 Views
  • 3 replies
  • 2 kudos
Latest Reply
Michael_Galli
Databricks Partner
  • 2 kudos

Update:Seems that maxFileAge was not a good idea. The following with the option "includeExistingFiles" = False solved my problem:streaming_df = ( spark.readStream.format("cloudFiles") .option("cloudFiles.format", extension) .option("...

  • 2 kudos
2 More Replies
AvijitDey
by New Contributor III
  • 6825 Views
  • 3 replies
  • 4 kudos

Resolved! Azure Databrick SQL bulk insert to AZ SQL

Env: Azure Databrick :version : 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)Work Type : 56 GB Memory 2-8 node ( standard D13_V2)No of rows : 2470350 and 115 Column Size : 2.2 GBTime taken approx. 9 min Python Code .What will be best approach for...

  • 6825 Views
  • 3 replies
  • 4 kudos
Latest Reply
AvijitDey
New Contributor III
  • 4 kudos

Any further suggestion

  • 4 kudos
2 More Replies
reedzhang
by New Contributor III
  • 6090 Views
  • 4 replies
  • 3 kudos

Resolved! uninstalled libraries continue to get installed on cluster startup

We have been trying to update some library versions by uninstalling the old versions and installing new ones. However, the old libraries continue to get installed on cluster startup despite not showing up in the "libraries" tab of the cluster page. W...

  • 6090 Views
  • 4 replies
  • 3 kudos
Latest Reply
reedzhang
New Contributor III
  • 3 kudos

The issue seemed to go away on its own. At some point the libraries page started showing what was getting installed to the cluster, and removing libraries from the page caused them to stop getting installed on cluster startup. I'm guessing there was ...

  • 3 kudos
3 More Replies
tomnguyen_195
by New Contributor III
  • 5029 Views
  • 4 replies
  • 7 kudos

Resolved! Increase input rate in Delta Live Tables

Hi,I need to ingest 60 millions json files from S3 and have create a Delta Live Tables to ingest these data to delta table with Auto Loader. However the input rate in my DLT is always around 8 records/second no matter how many worker I add to the DLT...

  • 5029 Views
  • 4 replies
  • 7 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 7 kudos

Please consider the following:consider having driver 2 times bigger than worker,check is S3 in the same region, is communicating via the private gateway (local IPs),enable S3 transfer acceleration,in ingestion please user autoloader as described here...

  • 7 kudos
3 More Replies
Bill
by New Contributor III
  • 3608 Views
  • 5 replies
  • 2 kudos

Resolved! How to access tables created in 2017

In 2017 while working on my Masters degree, I created some tables that I would like to access again. Back then I could just write SQL and find them but today that doesn't work. I suspect it has something to do with Delta Lake. What do I have to do to...

  • 3608 Views
  • 5 replies
  • 2 kudos
Latest Reply
Bill
New Contributor III
  • 2 kudos

That did it. Thanks

  • 2 kudos
4 More Replies
Anonymous
by Not applicable
  • 2061 Views
  • 1 replies
  • 1 kudos

Resolved! Unable to start cluster on E2 Workspace

Hello Community,I'm trying to create and start my first cluster on my E2 Databricks Workspace on AWS; however, the cluster is created but after STARTING the cluster immediately the cluster status goes to TERMINATING. Logs provided by Databricks show ...

  • 2061 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Update:It was an error on my side with the KMS key.

  • 1 kudos
Taha_Hussain
by Databricks Employee
  • 1686 Views
  • 1 replies
  • 6 kudos

Databricks Office Hours Our next Office Hours session is scheduled for May 18th from 8:00 am - 9:00am PT. Do you have questions about how to set up or...

Databricks Office HoursOur next Office Hours session is scheduled for May 18th from 8:00 am - 9:00am PT.Do you have questions about how to set up or use Databricks? Do you want to learn more about the best practices for deploying your use case or tip...

  • 1686 Views
  • 1 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 6 kudos

Just registered!

  • 6 kudos
Hubert-Dudek
by Databricks MVP
  • 1898 Views
  • 0 replies
  • 20 kudos

From Databricks runtime 10.5 you can get metadata using the hidden _metadata column. Currently, the column contains input files information (file_path...

From Databricks runtime 10.5 you can get metadata using the hidden _metadata column. Currently, the column contains input files information (file_path, file_name, file_size and file_modification_time)

firefox_2022-05-06_17-26-52
  • 1898 Views
  • 0 replies
  • 20 kudos
Ashley1
by Contributor
  • 4351 Views
  • 5 replies
  • 1 kudos

Resolved! Can ADLS be mounted in DBFS using only ADLS account key?

I realise this is not an optimal configuration but I'm trying to pull together a POC and I'm not at the point that I wish to ask the AAD admins to create an application for OAuth authentication.I have been able to use direct references to the ADLS co...

  • 4351 Views
  • 5 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hey there @Ashley Betts​ Thank you for posting your question. And you found the solution.This is awesome!Would you be happy to mark the answer as best so that other members can find the solution more quickly?Cheers!

  • 1 kudos
4 More Replies
Lincoln_Bergeso
by New Contributor II
  • 11072 Views
  • 8 replies
  • 4 kudos

Resolved! How do I read the contents of a hidden file in a Spark job?

I'm trying to read a file from a Google Cloud Storage bucket. The filename starts with a period, so Spark assumes the file is hidden and won't let me read it.My code is similar to this:from pyspark.sql import SparkSession   spark = SparkSession.build...

  • 11072 Views
  • 8 replies
  • 4 kudos
Latest Reply
Dan_Z
Databricks Employee
  • 4 kudos

I don't think there is an easy way to do this. You will also break very basic functionality (like being able to read Delta tables) if you were able to get around these constraints. I suggest you employ a rename job and then read.

  • 4 kudos
7 More Replies
Labels