cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Edmondo
by New Contributor III
  • 7194 Views
  • 7 replies
  • 3 kudos

Resolved! Limiting parallelism when external APIs are invoked (i.e. mlflow)

We are applying a groupby operation to a pyspark.sql.Dataframe and then on each group train a single model for mlflow. We see intermittent failures because the MLFlow server replies with a 429, because of too many requests/s   What are the best pract...

  • 7194 Views
  • 7 replies
  • 3 kudos
Latest Reply
Edmondo
New Contributor III
  • 3 kudos

To me it's already resolved through professional services. The question I do have is how useful is this community if people with the right background aren't here, and if it takes a month to get a no-answer.

  • 3 kudos
6 More Replies
thushar
by Contributor
  • 5313 Views
  • 5 replies
  • 3 kudos

Resolved! dataframe.rdd.isEmpty() is throwing error in 9.1 LTS

Loaded a csv file with five columns into a dataframe, and then added around 15+ columns using dataframe.withColumn method.After adding these many columns, when I run the query df.rdd.isEmpty() - which throws the below error. org.apache.spark.SparkExc...

  • 5313 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@Thushar R​ - Thank you for your patience. We are looking for the best person to help you.

  • 3 kudos
4 More Replies
hari
by Contributor
  • 3004 Views
  • 3 replies
  • 3 kudos

Resolved! Multi-cluster write for delta tables with s3 as the datastore

Does Delta currently support multi-cluster writes to delta table in s3?I see in the data bricks documentation that data bricks doesn't support writing to the same table from multiple spark drivers and thus multiple clusters.But s3Guard was also added...

  • 3004 Views
  • 3 replies
  • 3 kudos
Latest Reply
nastasiya09
New Contributor II
  • 3 kudos

that's really good post for memobdroverizon wifi

  • 3 kudos
2 More Replies
tonykun
by New Contributor
  • 4387 Views
  • 0 replies
  • 0 kudos

A dumb general question - why databricks no support java REPL?

I'm a new student to programming world, have strong interest in data engineering and databricks technology. I've tried this product, the UI, notebook, dbfs are very user-friendly and powerful.Recently, a doubt came to my mind why databricks doesn't s...

  • 4387 Views
  • 0 replies
  • 0 kudos
GMO
by New Contributor III
  • 3173 Views
  • 4 replies
  • 1 kudos

Resolved! Trigger.AvailableOnce in Pyspark?

There’s a new Trigger.AvailableOnce option in runtime 10.1 that we need to process a large folder bit by bit using Autoloader. But I don’t see how to engage this from pyspark.  Is this accessible from scala only or is it available in pyspark? Thanks...

  • 3173 Views
  • 4 replies
  • 1 kudos
Latest Reply
pottsork
New Contributor II
  • 1 kudos

Any update on this issue? I can see that one can use .trigger(availableNow=True) i DBR 10.3 (On Azure Databricks).... Unfortunately I can't get it to work with Autoloader. Is this supported? Additionally, can't find any answers when skimming through ...

  • 1 kudos
3 More Replies
enichante
by New Contributor
  • 3925 Views
  • 4 replies
  • 5 kudos

Resolved! Databricks: Report on SQL queries that are being executed

We have a SQL workspace with a cluster running that services a number of self service reports against a range of datasets. We want to be able to analyse and report on the queries our self service users are executing so we can get better visibility of...

  • 3925 Views
  • 4 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

Looks like the people have spoken: API is your best option! (thanks @Werner Stinckens​  @Chris Grabiel​  and @Bilal Aslam​ !) @eni chante​ Let us know if you have questions about the API! If not, please mark one of the replies above as the "best answ...

  • 5 kudos
3 More Replies
cristianc
by Contributor
  • 5404 Views
  • 2 replies
  • 2 kudos

Resolved! Is VACUUM operation recorded in the history of the delta table?

Greetings,I have tried using Spark with DBR 9.1 LTS to run VACUUM on my delta table then DESCRIBE HISTORY to see the operation, but apparently the VACUUM operation was not in the history despite the things stated in the documentation from: https://do...

  • 5404 Views
  • 2 replies
  • 2 kudos
Latest Reply
cristianc
Contributor
  • 2 kudos

That makes sense, thanks for the reply!

  • 2 kudos
1 More Replies
adnanzak
by New Contributor II
  • 3484 Views
  • 3 replies
  • 0 kudos

Resolved! Deploy Databricks Machine Learing Models On Power BI

Hi Guys. I've implemented a Machine Learning model on Databricks and have registered it with a Model URL. I wanted to enquire if I could use this model on Power BI. Basically the model predicts industries based on client demographics. Ideally I would...

  • 3484 Views
  • 3 replies
  • 0 kudos
Latest Reply
adnanzak
New Contributor II
  • 0 kudos

Thank you @Werner Stinckens​  and @Joseph Kambourakis​  for your replies.

  • 0 kudos
2 More Replies
DarshilDesai
by New Contributor II
  • 14398 Views
  • 1 replies
  • 3 kudos

Resolved! How to Efficiently Read Nested JSON in PySpark?

I am having trouble efficiently reading & parsing in a large number of stream files in Pyspark! Context Here is the schema of the stream file that I am reading in JSON. Blank spaces are edits for confidentiality purposes. root |-- location_info: ar...

  • 14398 Views
  • 1 replies
  • 3 kudos
Latest Reply
Chris_Shehu
Valued Contributor III
  • 3 kudos

I'm interested in seeing what others have come up with. Currently I'm using Json. normalize() then taking any additional nested statements and using a loop to pull them out -> re-combine them.

  • 3 kudos
umair
by New Contributor
  • 2726 Views
  • 1 replies
  • 1 kudos

Resolved! Cannot Reproduce Result scikit-learn random forest

I'm running some machine learning experiments in databricks. For random forest algorithm when i restart the cluster, each time the training output is changes even though random state is set. Anyone has any clue about this issue?Note : I tried the sam...

  • 2726 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

RF is non-deterministic by it´s nature.However as you mentioned you can control this by using random_state.This will guarantee a deterministic result ON A CERTAIN SYSTEM, but not necessarily over systems.SO has a topic about this, check it out, very ...

  • 1 kudos
Anonymous
by Not applicable
  • 2801 Views
  • 1 replies
  • 2 kudos

Issue in creating workspace - Custom AWS Configuration

We have tried to create new workspace using "Custom AWS Configuration" and we have given our own VPC (Customer managed VPC) and tried but workspace failed to launch. We are getting below error which couldn't understand where the issue is in.Workspace...

  • 2801 Views
  • 1 replies
  • 2 kudos
Latest Reply
Mitesh_Patel
New Contributor III
  • 2 kudos

I'm also getting the same issue. I'm trying to create a E2 workspace using Terraform with Customer-managed VPC in us-east-1 (using private subnets for 1a and 1b). We have 1 network rule attached to our subnets that looks like this:  Similar question ...

  • 2 kudos
BasavarajAngadi
by Contributor
  • 4137 Views
  • 7 replies
  • 9 kudos

Resolved! Hi Experts , I am new to databricks. I want to know how to copy pyspark data into databricks SQL analytics ?

If we use two different clusters one for pyspark code for transformation and one for SQL analytics . how to make permenant tables derived from pyspark code make available for running queries in databricks SQL analytics

  • 4137 Views
  • 7 replies
  • 9 kudos
Latest Reply
BasavarajAngadi
Contributor
  • 9 kudos

@Aman Sehgal​  Can we write data from data engineering workspace to SQL end point in databricks?

  • 9 kudos
6 More Replies
Users-all
by New Contributor
  • 2763 Views
  • 0 replies
  • 0 kudos

xml module not found error

ModuleNotFoundError: No module named 'com.databricks.spark.xml'I'm using Azure databricks, and I've added what I think is the correct library, Status InstalledCoordinatecom.databricks:spark-xml_2.12:0.13.0

  • 2763 Views
  • 0 replies
  • 0 kudos
alejandrofm
by Valued Contributor
  • 3053 Views
  • 3 replies
  • 1 kudos

Resolved! Recommendations to execute OPTIMIZE on tables

Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it ...

  • 3053 Views
  • 3 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Alejandro Martinez​ - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.

  • 1 kudos
2 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels