cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

KKo
by Contributor III
  • 7279 Views
  • 3 replies
  • 4 kudos

Resolved! Reading multiple parquet files from same _delta_log under a path

I have a path where there is _delta_log and 3 snappy.parquet files. I am trying to read all those .parquet using spark.read.format('delta').load(path) but I am getting data from only one same file all the time. Can't I read from all these files? If s...

  • 7279 Views
  • 3 replies
  • 4 kudos
Latest Reply
KKo
Contributor III
  • 4 kudos

@Werner Stinckens​ Thanks for the reply and explanation, that was helpful to understand the delta feature.

  • 4 kudos
2 More Replies
SailajaB
by Databricks Partner
  • 6047 Views
  • 5 replies
  • 4 kudos

Resolved! when and otherwise issue

Hi,Here in our scenario we are reading json files as input and it contains nested structure. Few of the attributes are array type struct. Where we need to change name of nested ones. So we created a new structure and doing cast.We are facing below pr...

  • 6047 Views
  • 5 replies
  • 4 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 4 kudos

Can you provide the structure that you're using?Also, a more elaborate sample input and output.

  • 4 kudos
4 More Replies
SailajaB
by Databricks Partner
  • 22538 Views
  • 4 replies
  • 4 kudos

Unable to mount the blob storage account as soft delete got enabled

Hi Team,when we try to mount or access the blob storage where soft delete enabled. But it is getting failed with below errororg.apache.hadoop.fs.FileAlreadyExistsException: Operation failed: "This endpoint does not support BlobStorageEvents or So...

  • 22538 Views
  • 4 replies
  • 4 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 4 kudos

Jeez, I was planning on enabling soft delete on our adls gen2, but I think I will wait a while after reading this.

  • 4 kudos
3 More Replies
JoeWMP
by New Contributor III
  • 7224 Views
  • 5 replies
  • 1 kudos

Resolved! Databricks Job ID's increasing in massive sequence gaps

Has anyone seen something like this before? Today around midnight, our Job ID's started increasing in increments of quadrillions - was this a new change to how Job ID's are generated?

  • 7224 Views
  • 5 replies
  • 1 kudos
Latest Reply
JoeWMP
New Contributor III
  • 1 kudos

Thank you Ravi! Glad that this confirms my understanding

  • 1 kudos
4 More Replies
Edmondo
by New Contributor III
  • 9900 Views
  • 7 replies
  • 3 kudos

Resolved! Limiting parallelism when external APIs are invoked (i.e. mlflow)

We are applying a groupby operation to a pyspark.sql.Dataframe and then on each group train a single model for mlflow. We see intermittent failures because the MLFlow server replies with a 429, because of too many requests/s   What are the best pract...

  • 9900 Views
  • 7 replies
  • 3 kudos
Latest Reply
Edmondo
New Contributor III
  • 3 kudos

To me it's already resolved through professional services. The question I do have is how useful is this community if people with the right background aren't here, and if it takes a month to get a no-answer.

  • 3 kudos
6 More Replies
thushar
by Databricks Partner
  • 7047 Views
  • 5 replies
  • 3 kudos

Resolved! dataframe.rdd.isEmpty() is throwing error in 9.1 LTS

Loaded a csv file with five columns into a dataframe, and then added around 15+ columns using dataframe.withColumn method.After adding these many columns, when I run the query df.rdd.isEmpty() - which throws the below error. org.apache.spark.SparkExc...

  • 7047 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@Thushar R​ - Thank you for your patience. We are looking for the best person to help you.

  • 3 kudos
4 More Replies
hari
by Contributor
  • 3934 Views
  • 3 replies
  • 3 kudos

Resolved! Multi-cluster write for delta tables with s3 as the datastore

Does Delta currently support multi-cluster writes to delta table in s3?I see in the data bricks documentation that data bricks doesn't support writing to the same table from multiple spark drivers and thus multiple clusters.But s3Guard was also added...

  • 3934 Views
  • 3 replies
  • 3 kudos
Latest Reply
nastasiya09
New Contributor II
  • 3 kudos

that's really good post for memobdroverizon wifi

  • 3 kudos
2 More Replies
tonykun
by New Contributor
  • 4864 Views
  • 0 replies
  • 0 kudos

A dumb general question - why databricks no support java REPL?

I'm a new student to programming world, have strong interest in data engineering and databricks technology. I've tried this product, the UI, notebook, dbfs are very user-friendly and powerful.Recently, a doubt came to my mind why databricks doesn't s...

  • 4864 Views
  • 0 replies
  • 0 kudos
GMO
by New Contributor III
  • 3936 Views
  • 4 replies
  • 1 kudos

Resolved! Trigger.AvailableOnce in Pyspark?

There’s a new Trigger.AvailableOnce option in runtime 10.1 that we need to process a large folder bit by bit using Autoloader. But I don’t see how to engage this from pyspark.  Is this accessible from scala only or is it available in pyspark? Thanks...

  • 3936 Views
  • 4 replies
  • 1 kudos
Latest Reply
pottsork
Databricks Partner
  • 1 kudos

Any update on this issue? I can see that one can use .trigger(availableNow=True) i DBR 10.3 (On Azure Databricks).... Unfortunately I can't get it to work with Autoloader. Is this supported? Additionally, can't find any answers when skimming through ...

  • 1 kudos
3 More Replies
enichante
by New Contributor
  • 4898 Views
  • 4 replies
  • 5 kudos

Resolved! Databricks: Report on SQL queries that are being executed

We have a SQL workspace with a cluster running that services a number of self service reports against a range of datasets. We want to be able to analyse and report on the queries our self service users are executing so we can get better visibility of...

  • 4898 Views
  • 4 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

Looks like the people have spoken: API is your best option! (thanks @Werner Stinckens​  @Chris Grabiel​  and @Bilal Aslam​ !) @eni chante​ Let us know if you have questions about the API! If not, please mark one of the replies above as the "best answ...

  • 5 kudos
3 More Replies
cristianc
by Contributor
  • 6528 Views
  • 2 replies
  • 2 kudos

Resolved! Is VACUUM operation recorded in the history of the delta table?

Greetings,I have tried using Spark with DBR 9.1 LTS to run VACUUM on my delta table then DESCRIBE HISTORY to see the operation, but apparently the VACUUM operation was not in the history despite the things stated in the documentation from: https://do...

  • 6528 Views
  • 2 replies
  • 2 kudos
Latest Reply
cristianc
Contributor
  • 2 kudos

That makes sense, thanks for the reply!

  • 2 kudos
1 More Replies
adnanzak
by New Contributor II
  • 4285 Views
  • 3 replies
  • 0 kudos

Resolved! Deploy Databricks Machine Learing Models On Power BI

Hi Guys. I've implemented a Machine Learning model on Databricks and have registered it with a Model URL. I wanted to enquire if I could use this model on Power BI. Basically the model predicts industries based on client demographics. Ideally I would...

  • 4285 Views
  • 3 replies
  • 0 kudos
Latest Reply
adnanzak
New Contributor II
  • 0 kudos

Thank you @Werner Stinckens​  and @Joseph Kambourakis​  for your replies.

  • 0 kudos
2 More Replies
DarshilDesai
by New Contributor II
  • 15751 Views
  • 1 replies
  • 3 kudos

Resolved! How to Efficiently Read Nested JSON in PySpark?

I am having trouble efficiently reading & parsing in a large number of stream files in Pyspark! Context Here is the schema of the stream file that I am reading in JSON. Blank spaces are edits for confidentiality purposes. root |-- location_info: ar...

  • 15751 Views
  • 1 replies
  • 3 kudos
Latest Reply
Chris_Shehu
Valued Contributor III
  • 3 kudos

I'm interested in seeing what others have come up with. Currently I'm using Json. normalize() then taking any additional nested statements and using a loop to pull them out -> re-combine them.

  • 3 kudos
umair
by New Contributor
  • 3463 Views
  • 1 replies
  • 1 kudos

Resolved! Cannot Reproduce Result scikit-learn random forest

I'm running some machine learning experiments in databricks. For random forest algorithm when i restart the cluster, each time the training output is changes even though random state is set. Anyone has any clue about this issue?Note : I tried the sam...

  • 3463 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

RF is non-deterministic by it´s nature.However as you mentioned you can control this by using random_state.This will guarantee a deterministic result ON A CERTAIN SYSTEM, but not necessarily over systems.SO has a topic about this, check it out, very ...

  • 1 kudos
Labels