Databricks Community

sanjay · 08-28-2024

Hi,I am trying to remove duplicate records from pyspark dataframe and keep the latest one. But somehow df.dropDuplicates["id"] keeps the first one instead of latest. One of the option is to use pandas drop_duplicates, Is there any solution in pyspark...

sanjay · 08-28-2024

Hi,I am using streaming on unity catalogue tables and trying to limit the number of records read in each batch. Here is my code but its not respecting maxFilesPerTrigger, instead reads all available data. (spark.readStream.option("skipChangeCommits",...

sanjay · 07-28-2024

Hi,I am deploying MLflow models using Databrick serverless serving but seems servers scale down to 0 only after 30 minute of inactivity. Is there any way to reduce this time?Also, Is it possible to deploy multiple models under single endpoint. I want...

sanjay · 06-28-2024

Hi,I have started getting following error while running jobs in databrick. It started failing since last few days. Recently I have migrated to unity catalogue, no other change was made recently. I am running on DBR 13.3 LTS. com.google.common.util.co...

sanjay · 02-19-2024

Hi,I am trying to deploy mlflow model in Sagemaker. My mlflow model is registered in Databrick.Followed below url to deploy and it need ECR for deployment. For ECR, either I can create custom image and push to ECR or its mentioned in below url to get...

sanjay · 08-29-2024

Thank you Witold, 2 was just an example. I am having thousands of files coming every second and want to limit files per batch otherwise process gets struct if there are too many files in given batch.I am able to limit the batch size while running sin...

sanjay · 08-29-2024

am able to resolve the issue. not sure what was the issue. Its working now without any code change.

sanjay · 08-28-2024

My table has multiple rows e.g. lets take simple table as employee with emd_id & emp_name columns and use streaming to process any updates to this table. In case there are more than 2 inserts, i want to process max 2 rows at a time

sanjay · 02-14-2024

Hi @Retired_mod Appreciate if you can help in resolving this issue.Regards,Sanjay

sanjay · 02-12-2024

Thank you @Retired_mod. As I am trying to remove duplicate only on single column, so am specifying column name in dropDuplicates. Still its very slow. Can you provide more context on last point i.e. Streamlining Your Data with Grouping and Aggregatio...

Databricks Community

User Stats

User Activity

Remove duplicate records using pyspark

maxFilesPerTrigger not working while loading data from Unity Catalogue table

how to reduce scale to zero time in MLFlow Serving

Error while running Job sparkSession is null while trying to executeCollectResult

Deploy mlflow model to Sagemaker

Re: maxFilesPerTrigger not working while loading data from Unity Catalogue table

Re: maxFilesPerTrigger not working while loading data from Unity Catalogue table

Re: maxFilesPerTrigger not working while loading data from Unity Catalogue table

Re: Performance issue while calling mlflow endpoint

Re: pyspark dropDuplicates performance issue