Data Engineering

Forum Posts

Sorted by:

by SankaraiahNaray • New Contributor II

12-24-2016 1:01:28 AM

29042 Views
10 replies
5 kudos

Not able to read text file from local file path - Spark CSV reader

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on yarn-client, its working fine in local mode. We are submitting the spark job in edge node. But when we place the file in local file path instead...

Data Engineering

29042 Views
10 replies
5 kudos

12-24-2016 1:01:28 AM

View Replies

Latest Reply

AshleeBall
New Contributor II

05-09-2024 5:37:13 AM

5 kudos

Thanks for your help. It helped me a lot.

5 kudos

05-09-2024 5:37:13 AM

9 More Replies

by User15787040559 • Databricks Employee

06-22-2021 3:39:55 PM

32961 Views
2 replies
8 kudos

What's the difference between a Global view and a Temp view?

The difference between Global and Temp is how the lifetime of the view is tied to the application:http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.createOrReplaceTempView.html?highlight=createorreplacetempview#pyspar...

Data Engineering

32961 Views
2 replies
8 kudos

06-22-2021 3:39:55 PM

View Replies

Latest Reply

ScottSmithDB
Databricks Employee

04-26-2024 6:27:18 PM

8 kudos

Correct A Temp View is scoped to the SparkSession and dropped when that session closes. Each notebook runs in its own SparkSession. The Global Temp View is scoped to the cluster and dropped when the cluster re-starts or you drop it. ---------------...

8 kudos

04-26-2024 6:27:18 PM

1 More Replies

by kishorekumar • New Contributor

06-20-2023 5:46:52 AM

1820 Views
1 replies
0 kudos

Silent failure in DataFrameWriter when loading data to Redshift

Context:I'm using DataFrameWriter to load the dataSet into the Redshift. DataFrameWriter writes the dataSet to S3, and loads data from S3 to Redshift by issuing the Redshift copy command. Issue:In frequently we are observing, the data is present in t...

Data Engineering

1820 Views
1 replies
0 kudos

06-20-2023 5:46:52 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-20-2023 8:23:29 PM

0 kudos

Hi @Kishorekumar Somasundaram Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.

0 kudos

06-20-2023 8:23:29 PM

by sage5616 • Valued Contributor

07-08-2022 8:39:55 AM

5537 Views
3 replies
4 kudos

Resolved! Spark persistent view on a partition parquet file

In Spark, is it possible to create a persistent view on a partitioned parquet file in Azure BLOB? The view must be available when the cluster restarted, without having to re-create that view, hence it cannot be a temp view.I can create a temp view, b...

Data Engineering

5537 Views
3 replies
4 kudos

07-08-2022 8:39:55 AM

View Replies

Latest Reply

sage5616
Valued Contributor

07-08-2022 10:06:20 AM

4 kudos

Here is what worked for me. Hope this helps someone else: https://stackoverflow.com/questions/72913913/spark-persistent-view-on-a-partition-parquet-file/72914245#72914245CREATE VIEW test as select * from parquet.`/mnt/folder-with-parquet-file(s)/`@Hu...

4 kudos

07-08-2022 10:06:20 AM

2 More Replies

by PradeepRavi • New Contributor III

08-01-2018 9:36:24 PM

37897 Views
6 replies
10 kudos

How do I prevent _success and _committed files in my write output?

Is there a way to prevent the _success and _committed files in my output. It's a tedious task to navigate to all the partitions and delete the files. Note : Final output is stored in Azure ADLS

Data Engineering

37897 Views
6 replies
10 kudos

08-01-2018 9:36:24 PM

View Replies

Latest Reply

shan_chandra
Databricks Employee

06-04-2022 11:57:58 AM

10 kudos

Please find the below steps to remove _SUCCESS, _committed and _started files.spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") to remove success file.run vacuum command multiple times until _committed and _started files...

10 kudos

06-04-2022 11:57:58 AM

5 More Replies

by cmotla • New Contributor III

03-18-2022 2:58:04 PM

2440 Views
1 replies
7 kudos

Issue with complex json based data frame select

We are getting the below error when trying to select the nested columns (string type in a struct) even though we don't have more than a 1000 records in the data frame. The schema is very complex and has few columns as struct type and few as array typ...

Data Engineering

2440 Views
1 replies
7 kudos

03-18-2022 2:58:04 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

03-20-2022 7:43:02 AM

7 kudos

Please share your code and some example of data.

7 kudos

03-20-2022 7:43:02 AM

by suman9872 • New Contributor II

02-23-2022 6:33:16 AM

2069 Views
0 replies
1 kudos

How to dynamically convert Spark DataFrame to Nested json using Spark Scala

I want to convert the DataFrame to nested json. Sourse Data:-DataFrame have data value like :- As image 2 Expected Output:-I have to convert DataFrame value to Nested Json like : -As image 1Appreciate your help !

Data Engineering

2069 Views
0 replies
1 kudos

02-23-2022 6:33:16 AM

by alejandrofm • Valued Contributor

02-12-2022 1:35:42 PM

5512 Views
2 replies
3 kudos

Resolved! Running vacuum on each table

Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?How can I know if I'll be getting any benefit from vacuum, without DRY RU...

Data Engineering

5512 Views
2 replies
3 kudos

02-12-2022 1:35:42 PM

View Replies

Latest Reply

AmanSehgal
Honored Contributor III

02-14-2022 5:22:57 AM

3 kudos

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days. To use VACCUM, first run some analytics on behaviour of your data.Identify % of operations that perform updates and deletes vs insert operati...

3 kudos

02-14-2022 5:22:57 AM

1 More Replies

by tusworten • New Contributor II

02-01-2022 8:39:13 AM

6800 Views
3 replies
3 kudos

Spark SQL Group by duplicates, collect_list in array of structs and evaluate rows in each group.

I'm begginner working with Spark SQL in Java API. I have a dataset with duplicate clients grouped by ENTITY and DOCUMENT_ID like this:.withColumn( "ROWNUMBER", row_number().over(Window.partitionBy("ENTITY", "ENTITY_DOC").orderBy("ID")))I added a ROWN...

Data Engineering

6800 Views
3 replies
3 kudos

02-01-2022 8:39:13 AM

View Replies

Latest Reply

tusworten
New Contributor II

02-07-2022 7:31:33 AM

3 kudos

Hi @Kaniz Fatma Her answer didn't solve my problem but it was useful to learn more about UDFS, which I did not know.

3 kudos

02-07-2022 7:31:33 AM

2 More Replies

by Autel • New Contributor II

01-08-2022 9:31:05 PM

4234 Views
3 replies
0 kudos

Resolved! concurrent update to same hive or deltalake table

HI, I'm interested to know if multiple executors to append the same hive table using saveAsTable or insertInto sparksql. will that cause any data corruption? What configuration do I need to enable concurrent write to same hive table? what about the s...

Data Engineering

4234 Views
3 replies
0 kudos

01-08-2022 9:31:05 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

01-10-2022 1:21:46 AM

0 kudos

The Hive table will not like this, as the underlying data is parquet format which is not ACID compliant.Delta lake however is:https://docs.delta.io/0.5.0/concurrency-control.htmlYou can see that inserts do not give conflicts.

0 kudos

01-10-2022 1:21:46 AM

2 More Replies

by MartinB • Contributor III

10-14-2021 7:18:49 AM

12253 Views
8 replies
9 kudos

Resolved! Is there a way to create a non-temporary Spark View with PySpark?

Hi,When creating a Spark view using SparkSQL ("CREATE VIEW AS SELCT ...") per default, this view is non-temporary - the view definition will survive the Spark session as well as the Spark cluster.In PySpark I can use DataFrame.createOrReplaceTempView...

Data Engineering

12253 Views
8 replies
9 kudos

10-14-2021 7:18:49 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

11-10-2021 5:29:29 AM

9 kudos

why not to create manage table?dataframe.write.mode(SaveMode.Overwrite).saveAsTable("<example-table>") # later when we need data resultDf = spark.read.table("<example-table>")

9 kudos

11-10-2021 5:29:29 AM

7 More Replies

by User15787040559 • Databricks Employee

06-07-2021 9:02:46 AM

2342 Views
1 replies
1 kudos

How can I get Databricks notebooks to stop cutting off the explain plans?

(since Spark 3.0)Dataset.queryExecution.debug.toFilewill dump the full plan to a file, without concatenating the output as a fully materialized Java string in memory.

Data Engineering

2342 Views
1 replies
1 kudos

06-07-2021 9:02:46 AM

View Replies

Latest Reply

dazfuller
Contributor III

09-28-2021 12:16:03 PM

1 kudos

Notebooks really aren't the best method of viewing large files. Two methods you could employ areSave the file to dbfs and then use databricks CLI to download the fileUse the web terminalIn the web terminal option you can do something like "cat my_lar...

1 kudos

09-28-2021 12:16:03 PM

by User15787040559 • Databricks Employee

06-07-2021 9:07:05 AM

1771 Views
2 replies
0 kudos

What subset of mysql sql syntax we support in spark sql?

https://spark.apache.org/docs/latest/sql-ref-syntax.html

Data Engineering

1771 Views
2 replies
0 kudos

06-07-2021 9:07:05 AM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-24-2021 1:45:34 AM

0 kudos

Spark 3 has experimental support for ANSI. Read more here:https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html

0 kudos

06-24-2021 1:45:34 AM

1 More Replies

by haseebkhan1421 • New Contributor

08-15-2021 11:46:43 AM

2684 Views
1 replies
3 kudos

How can I create a column on the fly which would have same value for all rows in spark sql query

I have a SQL query which I am converting into spark sql in azure databricks running in my jupyter notebook. In my SQL query, a column named Type is created on the fly which has value 'Goal' for every row:SELECT Type='Goal', Value FROM tableNow, when...

Data Engineering

2684 Views
1 replies
3 kudos

08-15-2021 11:46:43 AM

View Replies

Latest Reply

Ryan_Chynoweth
Esteemed Contributor

09-01-2021 2:24:20 PM

3 kudos

The correct syntax would be: SELECT 'Goal' AS Type, Value FROM table

3 kudos

09-01-2021 2:24:20 PM

by User15787040559 • Databricks Employee

06-22-2021 4:09:52 PM

4220 Views
1 replies
0 kudos

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

It depends. If you specify the schema it will be zero, otherwise it will do a full file scan which doesn’t work well processing Big Data at a large scale.CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark...

Data Engineering

4220 Views
1 replies
0 kudos

06-22-2021 4:09:52 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-22-2021 9:09:15 PM

0 kudos

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipe...

0 kudos

06-22-2021 9:09:15 PM