Data Engineering

Forum Posts

Sorted by:

by boomerangairpla • New Contributor

02-08-2022 4:59:35 AM

196 Views
0 replies
0 kudos

Liftndrift is a paper airplane blog, that helps the world to learn paper airplanes through easy and simple illustrated instructions, we are specialize...

Liftndrift is a paper airplane blog, that helps the world to learn paper airplanes through easy and simple illustrated instructions, we are specialized in teaching how to make a paper airplane. how to make the world record paper airplane

Data Engineering

196 Views
0 replies
0 kudos

02-08-2022 4:59:35 AM

by wyzer • Contributor II

02-07-2022 6:06:57 AM

1088 Views
3 replies
2 kudos

Resolved! Are we using the advantage of "Map & Reduce" ?

Hello,We are new on Databricks and we would like to know if our working method are good.Currently, we are working like this :spark.sql("CREATE TABLE Temp (SELECT avg(***), sum(***) FROM aaa LEFT JOIN bbb WHERE *** >= ***)")With this method, are we us...

Data Engineering

1088 Views
3 replies
2 kudos

02-07-2022 6:06:57 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

02-08-2022 2:23:30 AM

2 kudos

Spark will handle the map/reduce for you.So as long as you use Spark provided functions, be it in scala, python or sql (or even R) you will be using distributed processing.You just care about what you want as a result.And afterwards when you are more...

2 kudos

02-08-2022 2:23:30 AM

2 More Replies

by Databricks_7045 • New Contributor III

02-07-2022 4:53:18 AM

1261 Views
4 replies
0 kudos

Resolved! Encapsulate Databricks Pyspark/SparkSql code

Hi All ,I have Custom code ( Pyspark & SparkSQL) (notebooks) which I want to deploy at customer location and encapsulate so that end customers don't see the actual code. Currently we have all code in Notebooks (Pyspark/spark sql). Could you please l...

Data Engineering

1261 Views
4 replies
0 kudos

02-07-2022 4:53:18 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

02-08-2022 2:33:56 AM

0 kudos

With notebooks that is not possible.You can write your code in scala/java and build a jar, which you then run with spark-submit.(example)Or use python and deploy a wheel.(example)This can become quite complex when you have dependencies.Also: a jar et...

0 kudos

02-08-2022 2:33:56 AM

3 More Replies

by ghiet • New Contributor II

01-24-2022 1:08:55 PM

1872 Views
7 replies
6 kudos

Resolved! Cannot sign up to Databricks community edition - CAPTCHA error

Hello. I cannot sign up to have access to the community edition. I always get an error message "CAPTCHA error... contact our sales team". I do not have this issue if I try to create a trial account for Databricks hosted on AWS. However, I do not have...

Data Engineering

1872 Views
7 replies
6 kudos

01-24-2022 1:08:55 PM

View Replies

Latest Reply

joao_hoffmam
New Contributor III

02-07-2022 6:53:42 PM

6 kudos

Hi @Guillaume Hiet ,I was facing the same issue. Try signing up using your mobile phone, it worked for me!

6 kudos

02-07-2022 6:53:42 PM

6 More Replies

by data_scientist • New Contributor II

02-02-2022 4:41:16 PM

1249 Views
2 replies
2 kudos

Resolved! how to load a .w2v format saved model in databricks

Hi,I am trying load a pre-trained word2vec model which has been saved in .w2v format in databricks. I am not able to load this file . Help me with the correct command.

Data Engineering

1249 Views
2 replies
2 kudos

02-02-2022 4:41:16 PM

View Replies

Latest Reply

Kaniz
Community Manager

02-07-2022 7:53:00 AM

2 kudos

Hi @sonam de , To save models, use the MLflow functions log_model and save_model. You can also save models using their native APIs onto Databricks File System (DBFS). For MLlib models, use ML Pipelines.To export models for serving individual predict...

2 kudos

02-07-2022 7:53:00 AM

1 More Replies

by tusworten • New Contributor II

02-01-2022 8:39:13 AM

3810 Views
5 replies
4 kudos

Spark SQL Group by duplicates, collect_list in array of structs and evaluate rows in each group.

I'm begginner working with Spark SQL in Java API. I have a dataset with duplicate clients grouped by ENTITY and DOCUMENT_ID like this:.withColumn( "ROWNUMBER", row_number().over(Window.partitionBy("ENTITY", "ENTITY_DOC").orderBy("ID")))I added a ROWN...

Data Engineering

3810 Views
5 replies
4 kudos

02-01-2022 8:39:13 AM

View Replies

Latest Reply

tusworten
New Contributor II

02-07-2022 7:31:33 AM

4 kudos

Hi @Kaniz Fatma Her answer didn't solve my problem but it was useful to learn more about UDFS, which I did not know.

4 kudos

02-07-2022 7:31:33 AM

4 More Replies

by Constantine • Contributor III

02-01-2022 6:58:10 PM

2181 Views
3 replies
2 kudos

Resolved! OPTIMIZE throws an error after doing MERGE on the table

I have a table on which I do upsert i.e. MERGE INTO table_name ...After which I run OPTIMIZE table_nameWhich throws an errorjava.util.concurrent.ExecutionException: io.delta.exceptions.ConcurrentDeleteReadException: This transaction attempted to read...

Data Engineering

2181 Views
3 replies
2 kudos

02-01-2022 6:58:10 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

02-02-2022 6:37:00 AM

2 kudos

You can try to change isolation level:https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/isolation-levelIn merge is good to specify all partitions in merge conditions.It can also happen when script is running concurrently.

2 kudos

02-02-2022 6:37:00 AM

2 More Replies

by Smart_City_Laho • New Contributor

02-06-2022 10:49:41 PM

230 Views
0 replies
0 kudos

sigmaproperties.com.pk

smart city lahore lahore smart city locationlahore smart city payment plansmart city lahore locationcapital smart city lahore

Data Engineering

230 Views
0 replies
0 kudos

02-06-2022 10:49:41 PM

by al_joe • Contributor

02-05-2022 8:16:25 AM

1081 Views
1 replies
0 kudos

Resolved! Unable to do Developer Foundation Capstone. Where should I get/put "Registration ID"?

I am trying to do the "Developer Foundations Capstone".The first step as per video asks us to get the "Registration ID" from LMS email and plug it into the Notebook once you import the DBC.Two problems --#1 - I cannot locate any Registration ID at al...

Data Engineering

1081 Views
1 replies
0 kudos

02-05-2022 8:16:25 AM

View Replies

Latest Reply

Kaniz
Community Manager

02-06-2022 4:31:41 AM

0 kudos

Hi @Al Jo , I've informed the concerned team to take any action on the mentioned observation. I'm sure they'll hop in here anytime soon to fix your problem.We’d like to thank you again for taking the time to write this review of our new LMS. All fee...

0 kudos

02-06-2022 4:31:41 AM

by ST • New Contributor II

02-02-2022 8:25:58 AM

1543 Views
2 replies
2 kudos

Resolved! Convert Week of Year to Month in SQL?

Hi all, Was wondering if there was any built in function or code that I could utilize to convert a singular week of year integer (i.e. 1 to 52), into a value representing month (i.e. 1-12)? The assumption is that a week start on a Monday and end on a...

Data Engineering

1543 Views
2 replies
2 kudos

02-02-2022 8:25:58 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

02-02-2022 11:41:58 AM

2 kudos

we need old parser as new doesn't support weeks. Than we can map what we need using w - year of year and u - first day of the week:spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY") spark.sql(""" SELECT extract( month from to_date...

2 kudos

02-02-2022 11:41:58 AM

1 More Replies

by tanin • Contributor

02-06-2022 12:49:08 AM

471 Views
0 replies
1 kudos

Converting from RDD to Dataset, and unit test takes 3x slower. (but prod is faster)

I converted a data job fro RDD to Dataset, and I've found that, in prod, the data job runs faster, which is nice.But unit test runs 3x slower than before.My best guess is that Dataset spends time doing a lot of stuffs like encoding, optimizing, query...

Data Engineering

471 Views
0 replies
1 kudos

02-06-2022 12:49:08 AM

by AmarK • New Contributor III

02-03-2022 3:10:12 PM

5124 Views
3 replies
1 kudos

Resolved! Is there a way to programatically retrieve a workspace name ?

Is there a spark command in databricks that will tell me what databricks workspace I am using? I’d like to parameterise my code so that I can update delta lake file paths automatically depending on the workspace (i.e. it picks up the dev workspace na...

Data Engineering

5124 Views
3 replies
1 kudos

02-03-2022 3:10:12 PM

View Replies

Latest Reply

AmarK
New Contributor III

02-03-2022 11:54:51 PM

1 kudos

Thanks Navya! But this doesn't work for me on a High Concurrency cluster. It seems that toJson() isn't whitelisted.

1 kudos

02-03-2022 11:54:51 PM

2 More Replies

by Personal1 • New Contributor

02-03-2022 5:57:40 PM

954 Views
2 replies
2 kudos

Resolved! Does Databricks Certified Associate Developer for Apache Spark 3.0 in Python assess knowledge in Spark Streaming, ML, GraphX, RDD and UDF? May I get a link to the syllabus for this exam? Thank you!

Data Engineering

954 Views
2 replies
2 kudos

02-03-2022 5:57:40 PM

View Replies

Latest Reply

Kaniz
Community Manager

02-03-2022 10:28:11 PM

2 kudos

Hi @Abhishek Pradhan ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Than...

2 kudos

02-03-2022 10:28:11 PM

1 More Replies

by gbrueckl • Contributor II

10-14-2021 1:12:48 PM

2911 Views
6 replies
4 kudos

Resolved! CREATE FUNCTION from Python file

Is it somehow possible to create an SQL external function using Python code?the examples only show how to use JARshttps://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-function.htmlsomething like:CREATE TEMPORAR...

Data Engineering

2911 Views
6 replies
4 kudos

10-14-2021 1:12:48 PM

View Replies

Latest Reply

pts
New Contributor II

02-04-2022 6:11:28 PM

4 kudos

As a user of your code, I'd find it a less pleasant API because I'd have to some_module.some_func.some_func() rather than just some_module.some_func()No reason to have "some_func" exist twice in the hierarchy. It's kind of redundant. If some_func is ...

4 kudos

02-04-2022 6:11:28 PM

5 More Replies

by pjp94 • Contributor

01-28-2022 12:54:18 PM

6862 Views
5 replies
4 kudos

Resolved! Difference between DBFS and Delta Lake?

Would like a deeper dive/explanation into the difference. When I write to a table with the following code:spark_df.write.mode("overwrite").saveAsTable("db.table")The table is created and can be viewed in the Data tab. It can also be found in some DBF...

Data Engineering

6862 Views
5 replies
4 kudos

01-28-2022 12:54:18 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

01-31-2022 12:47:53 AM

4 kudos

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parq...

4 kudos

01-31-2022 12:47:53 AM

4 More Replies

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Liftndrift is a paper airplane blog, that helps the world to learn paper airplanes through easy and simple illustrated instructions, we are specialize...

Resolved! Are we using the advantage of "Map & Reduce" ?

Resolved! Encapsulate Databricks Pyspark/SparkSql code

Resolved! Cannot sign up to Databricks community edition - CAPTCHA error

Resolved! how to load a .w2v format saved model in databricks

Spark SQL Group by duplicates, collect_list in array of structs and evaluate rows in each group.

Resolved! OPTIMIZE throws an error after doing MERGE on the table

sigmaproperties.com.pk

Resolved! Unable to do Developer Foundation Capstone. Where should I get/put "Registration ID"?

Resolved! Convert Week of Year to Month in SQL?

Converting from RDD to Dataset, and unit test takes 3x slower. (but prod is faster)

Resolved! Is there a way to programatically retrieve a workspace name ?

Resolved! Does Databricks Certified Associate Developer for Apache Spark 3.0 in Python assess knowledge in Spark Streaming, ML, GraphX, RDD and UDF? May I get a link to the syllabus for this exam? Thank you!

Resolved! CREATE FUNCTION from Python file

Resolved! Difference between DBFS and Delta Lake?

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...