Data Engineering

Forum Posts

Sorted by:

by SailajaB • Valued Contributor III

01-19-2022 5:29:16 AM

2298 Views
4 replies
6 kudos

Resolved! how to create a nested(unflatten) json from flatten json

Hi ,Is there any function in pyspark which can convert flatten json to nested json.Ex : if we have attribute in flatten is like a_b_c : 23then in unflatten it should be{"a":{"b":{"c":23}}}Thank you

Data Engineering

2298 Views
4 replies
6 kudos

01-19-2022 5:29:16 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-20-2022 2:44:30 AM

6 kudos

As @Chuck Connell said can you share more of your source json as that example is not json. Additionally flatten is usually to change something like {"status": {"A": 1,"B": 2}} to {"status.A": 1, "status.B": 2} which can be done easily with spark da...

6 kudos

01-20-2022 2:44:30 AM

3 More Replies

by irfanaziz • Contributor II

01-13-2022 4:39:22 AM

1697 Views
3 replies
1 kudos

Resolved! What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8...

Data Engineering

1697 Views
3 replies
1 kudos

01-13-2022 4:39:22 AM

View Replies

Latest Reply

jose_gonzalez
Moderator

02-08-2022 4:41:55 PM

1 kudos

Hi @nafri A ,What is the error you are getting, can you share it please? Like @Hubert Dudek mentioned, both will call the same APIs

1 kudos

02-08-2022 4:41:55 PM

2 More Replies

by boomerangairpla • New Contributor

02-08-2022 4:59:35 AM

191 Views
0 replies
0 kudos

Liftndrift is a paper airplane blog, that helps the world to learn paper airplanes through easy and simple illustrated instructions, we are specialize...

Liftndrift is a paper airplane blog, that helps the world to learn paper airplanes through easy and simple illustrated instructions, we are specialized in teaching how to make a paper airplane. how to make the world record paper airplane

Data Engineering

191 Views
0 replies
0 kudos

02-08-2022 4:59:35 AM

by wyzer • Contributor II

02-07-2022 6:06:57 AM

1063 Views
3 replies
2 kudos

Resolved! Are we using the advantage of "Map & Reduce" ?

Hello,We are new on Databricks and we would like to know if our working method are good.Currently, we are working like this :spark.sql("CREATE TABLE Temp (SELECT avg(***), sum(***) FROM aaa LEFT JOIN bbb WHERE *** >= ***)")With this method, are we us...

Data Engineering

1063 Views
3 replies
2 kudos

02-07-2022 6:06:57 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

02-08-2022 2:23:30 AM

2 kudos

Spark will handle the map/reduce for you.So as long as you use Spark provided functions, be it in scala, python or sql (or even R) you will be using distributed processing.You just care about what you want as a result.And afterwards when you are more...

2 kudos

02-08-2022 2:23:30 AM

2 More Replies

by Databricks_7045 • New Contributor III

02-07-2022 4:53:18 AM

1234 Views
4 replies
0 kudos

Resolved! Encapsulate Databricks Pyspark/SparkSql code

Hi All ,I have Custom code ( Pyspark & SparkSQL) (notebooks) which I want to deploy at customer location and encapsulate so that end customers don't see the actual code. Currently we have all code in Notebooks (Pyspark/spark sql). Could you please l...

Data Engineering

1234 Views
4 replies
0 kudos

02-07-2022 4:53:18 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

02-08-2022 2:33:56 AM

0 kudos

With notebooks that is not possible.You can write your code in scala/java and build a jar, which you then run with spark-submit.(example)Or use python and deploy a wheel.(example)This can become quite complex when you have dependencies.Also: a jar et...

0 kudos

02-08-2022 2:33:56 AM

3 More Replies

by ghiet • New Contributor II

01-24-2022 1:08:55 PM

1826 Views
7 replies
6 kudos

Resolved! Cannot sign up to Databricks community edition - CAPTCHA error

Hello. I cannot sign up to have access to the community edition. I always get an error message "CAPTCHA error... contact our sales team". I do not have this issue if I try to create a trial account for Databricks hosted on AWS. However, I do not have...

Data Engineering

1826 Views
7 replies
6 kudos

01-24-2022 1:08:55 PM

View Replies

Latest Reply

joao_hoffmam
New Contributor III

02-07-2022 6:53:42 PM

6 kudos

Hi @Guillaume Hiet ,I was facing the same issue. Try signing up using your mobile phone, it worked for me!

6 kudos

02-07-2022 6:53:42 PM

6 More Replies

by data_scientist • New Contributor II

02-02-2022 4:41:16 PM

1229 Views
2 replies
2 kudos

Resolved! how to load a .w2v format saved model in databricks

Hi,I am trying load a pre-trained word2vec model which has been saved in .w2v format in databricks. I am not able to load this file . Help me with the correct command.

Data Engineering

1229 Views
2 replies
2 kudos

02-02-2022 4:41:16 PM

View Replies

Latest Reply

Kaniz
Community Manager

02-07-2022 7:53:00 AM

2 kudos

Hi @sonam de , To save models, use the MLflow functions log_model and save_model. You can also save models using their native APIs onto Databricks File System (DBFS). For MLlib models, use ML Pipelines.To export models for serving individual predict...

2 kudos

02-07-2022 7:53:00 AM

1 More Replies

by tusworten • New Contributor II

02-01-2022 8:39:13 AM

3716 Views
5 replies
4 kudos

Spark SQL Group by duplicates, collect_list in array of structs and evaluate rows in each group.

I'm begginner working with Spark SQL in Java API. I have a dataset with duplicate clients grouped by ENTITY and DOCUMENT_ID like this:.withColumn( "ROWNUMBER", row_number().over(Window.partitionBy("ENTITY", "ENTITY_DOC").orderBy("ID")))I added a ROWN...

Data Engineering

3716 Views
5 replies
4 kudos

02-01-2022 8:39:13 AM

View Replies

Latest Reply

tusworten
New Contributor II

02-07-2022 7:31:33 AM

4 kudos

Hi @Kaniz Fatma Her answer didn't solve my problem but it was useful to learn more about UDFS, which I did not know.

4 kudos

02-07-2022 7:31:33 AM

4 More Replies

by Constantine • Contributor III

02-01-2022 6:58:10 PM

2136 Views
3 replies
2 kudos

Resolved! OPTIMIZE throws an error after doing MERGE on the table

I have a table on which I do upsert i.e. MERGE INTO table_name ...After which I run OPTIMIZE table_nameWhich throws an errorjava.util.concurrent.ExecutionException: io.delta.exceptions.ConcurrentDeleteReadException: This transaction attempted to read...

Data Engineering

2136 Views
3 replies
2 kudos

02-01-2022 6:58:10 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

02-02-2022 6:37:00 AM

2 kudos

You can try to change isolation level:https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/isolation-levelIn merge is good to specify all partitions in merge conditions.It can also happen when script is running concurrently.

2 kudos

02-02-2022 6:37:00 AM

2 More Replies

by Smart_City_Laho • New Contributor

02-06-2022 10:49:41 PM

221 Views
0 replies
0 kudos

sigmaproperties.com.pk

smart city lahore lahore smart city locationlahore smart city payment plansmart city lahore locationcapital smart city lahore

Data Engineering

221 Views
0 replies
0 kudos

02-06-2022 10:49:41 PM

by al_joe • Contributor

02-05-2022 12:13:34 AM

5259 Views
3 replies
3 kudos

Resolved! Split a code cell at cursor position? Add a cell above/below?

In JupyterLab notebooks, we can --In edit mode, you can press Ctrl+Shift+Minus to split the current cell into two at the cursor position In command mode, you can click A or B to add a cell Above or Below the current cellare there equivalent shortcuts...

Data Engineering

5259 Views
3 replies
3 kudos

02-05-2022 12:13:34 AM

View Replies

Latest Reply

Anonymous
Not applicable

02-05-2022 2:16:23 PM

3 kudos

If you click the keyboard icon in the top right of the notebook it brings up all the hot keys

3 kudos

02-05-2022 2:16:23 PM

2 More Replies

by al_joe • Contributor

02-05-2022 8:16:25 AM

1071 Views
1 replies
0 kudos

Resolved! Unable to do Developer Foundation Capstone. Where should I get/put "Registration ID"?

I am trying to do the "Developer Foundations Capstone".The first step as per video asks us to get the "Registration ID" from LMS email and plug it into the Notebook once you import the DBC.Two problems --#1 - I cannot locate any Registration ID at al...

Data Engineering

1071 Views
1 replies
0 kudos

02-05-2022 8:16:25 AM

View Replies

Latest Reply

Kaniz
Community Manager

02-06-2022 4:31:41 AM

0 kudos

Hi @Al Jo , I've informed the concerned team to take any action on the mentioned observation. I'm sure they'll hop in here anytime soon to fix your problem.We’d like to thank you again for taking the time to write this review of our new LMS. All fee...

0 kudos

02-06-2022 4:31:41 AM

by ST • New Contributor II

02-02-2022 8:25:58 AM

1507 Views
2 replies
2 kudos

Resolved! Convert Week of Year to Month in SQL?

Hi all, Was wondering if there was any built in function or code that I could utilize to convert a singular week of year integer (i.e. 1 to 52), into a value representing month (i.e. 1-12)? The assumption is that a week start on a Monday and end on a...

Data Engineering

1507 Views
2 replies
2 kudos

02-02-2022 8:25:58 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

02-02-2022 11:41:58 AM

2 kudos

we need old parser as new doesn't support weeks. Than we can map what we need using w - year of year and u - first day of the week:spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY") spark.sql(""" SELECT extract( month from to_date...

2 kudos

02-02-2022 11:41:58 AM

1 More Replies

by tanin • Contributor

02-06-2022 12:49:08 AM

461 Views
0 replies
1 kudos

Converting from RDD to Dataset, and unit test takes 3x slower. (but prod is faster)

I converted a data job fro RDD to Dataset, and I've found that, in prod, the data job runs faster, which is nice.But unit test runs 3x slower than before.My best guess is that Dataset spends time doing a lot of stuffs like encoding, optimizing, query...

Data Engineering

461 Views
0 replies
1 kudos

02-06-2022 12:49:08 AM

by AmarK • New Contributor III

02-03-2022 3:10:12 PM

5001 Views
3 replies
1 kudos

Resolved! Is there a way to programatically retrieve a workspace name ?

Is there a spark command in databricks that will tell me what databricks workspace I am using? I’d like to parameterise my code so that I can update delta lake file paths automatically depending on the workspace (i.e. it picks up the dev workspace na...

Data Engineering

5001 Views
3 replies
1 kudos

02-03-2022 3:10:12 PM

View Replies

Latest Reply

AmarK
New Contributor III

02-03-2022 11:54:51 PM

1 kudos

Thanks Navya! But this doesn't work for me on a High Concurrency cluster. It seems that toJson() isn't whitelisted.

1 kudos

02-03-2022 11:54:51 PM

2 More Replies

User

Count

1601

736

343

284

247

Databricks

Forum Posts

Resolved! how to create a nested(unflatten) json from flatten json

Resolved! What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

Liftndrift is a paper airplane blog, that helps the world to learn paper airplanes through easy and simple illustrated instructions, we are specialize...

Resolved! Are we using the advantage of "Map & Reduce" ?

Resolved! Encapsulate Databricks Pyspark/SparkSql code

Resolved! Cannot sign up to Databricks community edition - CAPTCHA error

Resolved! how to load a .w2v format saved model in databricks

Spark SQL Group by duplicates, collect_list in array of structs and evaluate rows in each group.

Resolved! OPTIMIZE throws an error after doing MERGE on the table

sigmaproperties.com.pk

Resolved! Split a code cell at cursor position? Add a cell above/below?

Resolved! Unable to do Developer Foundation Capstone. Where should I get/put "Registration ID"?

Resolved! Convert Week of Year to Month in SQL?

Converting from RDD to Dataset, and unit test takes 3x slower. (but prod is faster)

Resolved! Is there a way to programatically retrieve a workspace name ?

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...