cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User15787040559
by New Contributor III
  • 17697 Views
  • 2 replies
  • 3 kudos

What's the difference between a Global view and a Temp view?

The difference between Global and Temp is how the lifetime of the view is tied to the application:http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.createOrReplaceTempView.html?highlight=createorreplacetempview#pyspar...

  • 17697 Views
  • 2 replies
  • 3 kudos
Latest Reply
ScottSmithDB
Contributor II
  • 3 kudos

Correct A Temp View is scoped to the SparkSession and dropped when that session closes.  Each notebook runs in its own SparkSession.  The Global Temp View is scoped to the cluster and dropped when the cluster re-starts or you drop it. ---------------...

  • 3 kudos
1 More Replies
kishorekumar
by New Contributor
  • 627 Views
  • 1 replies
  • 0 kudos

Silent failure in DataFrameWriter when loading data to Redshift

Context:I'm using DataFrameWriter to load the dataSet into the Redshift. DataFrameWriter writes the dataSet to S3, and loads data from S3 to Redshift by issuing the Redshift copy command. Issue:In frequently we are observing, the data is present in t...

  • 627 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Kishorekumar Somasundaram​ Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.

  • 0 kudos
suman9872
by New Contributor II
  • 1183 Views
  • 1 replies
  • 1 kudos

How to dynamically convert Spark DataFrame to Nested json using Spark Scala

I want to convert the DataFrame to nested json. Sourse Data:-DataFrame have data value like :- As image 2 Expected Output:-I have to convert DataFrame value to Nested Json like : -As image 1Appreciate your help !

  • 1183 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

Hi @Suman Mishra​, This article explains how to convert a flattened DataFrame to a nested structure by nesting a case class within another case class.You can use this technique to build a JSON file that can then be sent to an external API.

  • 1 kudos
sage5616
by Valued Contributor
  • 2884 Views
  • 3 replies
  • 4 kudos

Resolved! Spark persistent view on a partition parquet file

In Spark, is it possible to create a persistent view on a partitioned parquet file in Azure BLOB? The view must be available when the cluster restarted, without having to re-create that view, hence it cannot be a temp view.I can create a temp view, b...

  • 2884 Views
  • 3 replies
  • 4 kudos
Latest Reply
sage5616
Valued Contributor
  • 4 kudos

Here is what worked for me. Hope this helps someone else: https://stackoverflow.com/questions/72913913/spark-persistent-view-on-a-partition-parquet-file/72914245#72914245CREATE VIEW test as select * from parquet.`/mnt/folder-with-parquet-file(s)/`@Hu...

  • 4 kudos
2 More Replies
PradeepRavi
by New Contributor III
  • 25254 Views
  • 6 replies
  • 9 kudos

How do I prevent _success and _committed files in my write output?

Is there a way to prevent the _success and _committed files in my output. It's a tedious task to navigate to all the partitions and delete the files. Note : Final output is stored in Azure ADLS

  • 25254 Views
  • 6 replies
  • 9 kudos
Latest Reply
shan_chandra
Honored Contributor III
  • 9 kudos

Please find the below steps to remove _SUCCESS, _committed and _started files.spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") to remove success file.run vacuum command multiple times until _committed and _started files...

  • 9 kudos
5 More Replies
cmotla
by New Contributor III
  • 1322 Views
  • 3 replies
  • 8 kudos

Issue with complex json based data frame select

We are getting the below error when trying to select the nested columns (string type in a struct) even though we don't have more than a 1000 records in the data frame. The schema is very complex and has few columns as struct type and few as array typ...

  • 1322 Views
  • 3 replies
  • 8 kudos
Latest Reply
Kaniz
Community Manager
  • 8 kudos

Hi @Chaitanya Motla​ , Just a friendly follow-up. Do you still need help, or did you find the solution? Please let us know.

  • 8 kudos
2 More Replies
alejandrofm
by Valued Contributor
  • 2449 Views
  • 2 replies
  • 3 kudos

Resolved! Running vacuum on each table

Hi, in line with my question about optimize, this is the next step, with a retention of 7 days I could execute vacuum on all tables once a week, is this a recommended procedure?How can I know if I'll be getting any benefit from vacuum, without DRY RU...

  • 2449 Views
  • 2 replies
  • 3 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 3 kudos

Ideally 7 days is recommended, but discuss with data stakeholders to identify what's suitable? 7/14/28 days. To use VACCUM, first run some analytics on behaviour of your data.Identify % of operations that perform updates and deletes vs insert operati...

  • 3 kudos
1 More Replies
tusworten
by New Contributor II
  • 3711 Views
  • 5 replies
  • 4 kudos

Spark SQL Group by duplicates, collect_list in array of structs and evaluate rows in each group.

I'm begginner working with Spark SQL in Java API. I have a dataset with duplicate clients grouped by ENTITY and DOCUMENT_ID like this:.withColumn( "ROWNUMBER", row_number().over(Window.partitionBy("ENTITY", "ENTITY_DOC").orderBy("ID")))I added a ROWN...

1
  • 3711 Views
  • 5 replies
  • 4 kudos
Latest Reply
tusworten
New Contributor II
  • 4 kudos

Hi @Kaniz Fatma​ Her answer didn't solve my problem but it was useful to learn more about UDFS, which I did not know.

  • 4 kudos
4 More Replies
Autel
by New Contributor II
  • 2144 Views
  • 4 replies
  • 1 kudos

Resolved! concurrent update to same hive or deltalake table

HI, I'm interested to know if multiple executors to append the same hive table using saveAsTable or insertInto sparksql. will that cause any data corruption? What configuration do I need to enable concurrent write to same hive table? what about the s...

  • 2144 Views
  • 4 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

Hi @Weide Zhang​ , Does @[werners] (Customer)​ 's reply answer your question?

  • 1 kudos
3 More Replies
SankaraiahNaray
by New Contributor II
  • 19631 Views
  • 10 replies
  • 6 kudos

Resolved! Not able to read text file from local file path - Spark CSV reader

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on yarn-client, its working fine in local mode. We are submitting the spark job in edge node. But when we place the file in local file path instead...

  • 19631 Views
  • 10 replies
  • 6 kudos
Latest Reply
Kaniz
Community Manager
  • 6 kudos

Hi @Sankaraiah Narayanasamy​ , Seems like a bug in spark-shell command when reading a local file, But there is a workaround while running spark-submit command just specify in the command.--conf "spark.authenticate=false"SPARK-23476 for reference.

  • 6 kudos
9 More Replies
MartinB
by Contributor III
  • 6658 Views
  • 8 replies
  • 9 kudos

Resolved! Is there a way to create a non-temporary Spark View with PySpark?

Hi,When creating a Spark view using SparkSQL ("CREATE VIEW AS SELCT ...") per default, this view is non-temporary - the view definition will survive the Spark session as well as the Spark cluster.In PySpark I can use DataFrame.createOrReplaceTempView...

  • 6658 Views
  • 8 replies
  • 9 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 9 kudos

why not to create manage table?dataframe.write.mode(SaveMode.Overwrite).saveAsTable("<example-table>")   # later when we need data resultDf = spark.read.table("<example-table>")

  • 9 kudos
7 More Replies
User15787040559
by New Contributor III
  • 1435 Views
  • 2 replies
  • 1 kudos

How can I get Databricks notebooks to stop cutting off the explain plans?

(since Spark 3.0)Dataset.queryExecution.debug.toFilewill dump the full plan to a file, without concatenating the output as a fully materialized Java string in memory.

  • 1435 Views
  • 2 replies
  • 1 kudos
Latest Reply
dazfuller
Contributor III
  • 1 kudos

Notebooks really aren't the best method of viewing large files. Two methods you could employ areSave the file to dbfs and then use databricks CLI to download the fileUse the web terminalIn the web terminal option you can do something like "cat my_lar...

  • 1 kudos
1 More Replies
User15787040559
by New Contributor III
  • 784 Views
  • 2 replies
  • 0 kudos

What subset of mysql sql syntax we support in spark sql?

https://spark.apache.org/docs/latest/sql-ref-syntax.html

  • 784 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

Spark 3 has experimental support for ANSI. Read more here:https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html

  • 0 kudos
1 More Replies
haseebkhan1421
by New Contributor
  • 1567 Views
  • 1 replies
  • 3 kudos

How can I create a column on the fly which would have same value for all rows in spark sql query

I have a SQL query which I am converting into spark sql in azure databricks running in my jupyter notebook. In my SQL query, a column named Type is created on the fly which has value 'Goal' for every row:SELECT Type='Goal', Value FROM tableNow, when...

  • 1567 Views
  • 1 replies
  • 3 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 3 kudos

The correct syntax would be: SELECT 'Goal' AS Type, Value FROM table

  • 3 kudos
User15787040559
by New Contributor III
  • 2531 Views
  • 1 replies
  • 0 kudos

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

It depends. If you specify the schema it will be zero, otherwise it will do a full file scan which doesn’t work well processing Big Data at a large scale.CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark...

  • 2531 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipe...

  • 0 kudos
Labels