Topics with Label: Pyspark Dataframe

Forum Posts

Sorted by:

by Mado • Valued Contributor II

10-19-2022 6:26:06 AM

4664 Views
4 replies
4 kudos

Resolved! Difference between "spark.table" & "spark.read.table"?

Hi,I want to make a PySpark DataFrame from a Table. I would like to ask about the difference of the following commands:spark.read.table(TableName)&spark.table(TableName)Both return PySpark DataFrame and look similar. Thanks.

Data Engineering

4664 Views
4 replies
4 kudos

10-19-2022 6:26:06 AM

View Replies

Latest Reply

Mado
Valued Contributor II

10-20-2022 6:21:17 AM

4 kudos

Hi @Kaniz Fatma I selected answer from @Kedar Deshpande as the best answer.

4 kudos

10-20-2022 6:21:17 AM

3 More Replies

by RamaSantosh • New Contributor II

09-19-2022 9:51:49 PM

2549 Views
2 replies
3 kudos

Data load from Azure databricks dataframe to cosmos db container

I am trying to load data from Azure databricks dataframe to cosmos db container using below commandcfg = { "spark.cosmos.accountEndpoint" : cosmosEndpoint, "spark.cosmos.accountKey" : cosmosMasterKey, "spark.cosmos.database" : cosmosDatabaseName, "sp...

Data Engineering

2549 Views
2 replies
3 kudos

09-19-2022 9:51:49 PM

View Replies

Latest Reply

Anonymous
Not applicable

10-02-2022 4:02:54 AM

3 kudos

Hey @Rama Santosh Ravada Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from y...

3 kudos

10-02-2022 4:02:54 AM

1 More Replies

by anonymous1 • New Contributor III

09-01-2022 4:46:06 PM

3841 Views
7 replies
5 kudos

How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

Schema Design :Source : Miltiple CSV Files like (SourceFile1 ,SourceFile2)Target : Delta Table like (Target_Table)Excel File : ETL_Mapping_SheetFile Columns : SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogicMappingLogic columns cont...

Data Engineering

3841 Views
7 replies
5 kudos

09-01-2022 4:46:06 PM

View Replies

Latest Reply

AmanSehgal
Honored Contributor III

09-20-2022 12:55:34 AM

5 kudos

Following on @Werner Stinckens response, if you can give an example then it will be good.Ideally you can read each row from excel file in python and pass each column as a parameter to a function.Eg; def apply_mapping_logic(SourceTable ,SourceColumn,...

5 kudos

09-20-2022 12:55:34 AM

6 More Replies

by KNP • New Contributor

08-04-2022 12:22:51 PM

1887 Views
2 replies
0 kudos

passing array as a parameter to PandasUDF

Hi Team,My python dataframe is as below.The raw data is quite a long series of approx 5000 numbers. My requirement is to go through each row in RawData column and calculate 2 metrics. I have created a function in Python and it works absolutely fine. ...

Data Engineering

1887 Views
2 replies
0 kudos

08-04-2022 12:22:51 PM

View Replies

Latest Reply

Vidula
Honored Contributor

09-07-2022 4:36:56 AM

0 kudos

Hello @Kausthub NP Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

0 kudos

09-07-2022 4:36:56 AM

1 More Replies

by KumarShiv • New Contributor III

08-01-2022 5:36:22 AM

1132 Views
2 replies
2 kudos

Resolved! Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.

I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.Here are both result sets for Pyspark ...

Data Engineering

1132 Views
2 replies
2 kudos

08-01-2022 5:36:22 AM

View Replies

Latest Reply

artsheiko
Valued Contributor III

08-03-2022 11:17:53 AM

2 kudos

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

2 kudos

08-03-2022 11:17:53 AM

1 More Replies

by Sha_1890 • New Contributor III

08-13-2022 4:46:50 AM

539 Views
0 replies
3 kudos

Longer execution time to write into the SQL server table from Spark Dataframe

I have 8gb of XML data loaded into different dataframes, there are two dataframes which has 24 lakh and 82 lakh data to be written to a 2 SQL server tables which is taking so 2 hrs and 5 hrs of time to write it. I am using the below cluster configura...

Data Engineering

539 Views
0 replies
3 kudos

08-13-2022 4:46:50 AM

by NathanLaw • New Contributor III

05-19-2022 10:52:36 AM

2331 Views
8 replies
1 kudos

Model Training Data Adapter Error.

We are converting Pyspark dataframe to Tensorflow using PetaStorm and have encountered a “data adapter” error. What do you recommend for diagnosing and fixing this error?https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/...

Data Engineering

2331 Views
8 replies
1 kudos

05-19-2022 10:52:36 AM

View Replies

Latest Reply

Anonymous
Not applicable

07-19-2022 8:35:54 AM

1 kudos

Hey @Nathan Law Thank you so much for getting back to us. We will await your response.We really appreciate your time.

1 kudos

07-19-2022 8:35:54 AM

7 More Replies

by Eyespoop • New Contributor II

06-23-2022 2:59:21 AM

11748 Views
3 replies
2 kudos

Resolved! PySpark: Writing Parquet Files to the Azure Blob Storage Container

Currently I am having some issues with the writing of the parquet file in the Storage Container. I do have the codes running but whenever the dataframe writer puts the parquet to the blob storage instead of the parquet file type, it is created as a f...

Data Engineering

11748 Views
3 replies
2 kudos

06-23-2022 2:59:21 AM

View Replies

Latest Reply

User16764241763
Honored Contributor

06-27-2022 4:29:21 AM

2 kudos

Hello @Karl Saycon Can you try setting this config to prevent additional parquet summary and metadata files from being written? The result from dataframe write to storage should be a single file.https://community.databricks.com/s/question/0D53f00001...

2 kudos

06-27-2022 4:29:21 AM

2 More Replies

by SusuTheSeeker • New Contributor III

06-06-2022 3:45:09 AM

2113 Views
8 replies
3 kudos

Kernel switches to unknown using pyspark

I am working in jupyter hub in a notebook. I am using pyspark dataframe for analyzing text. More precisely I am doing sentimment analysis of newspaper articles. The code works until I get to some point where the kernel is busy and after approximately...

Data Engineering

2113 Views
8 replies
3 kudos

06-06-2022 3:45:09 AM

View Replies

Latest Reply

Kaniz
Community Manager

06-13-2022 3:13:06 AM

3 kudos

Hi @Suad Hidbani , We haven’t heard from you on the last responses from us, and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will...

3 kudos

06-13-2022 3:13:06 AM

7 More Replies

by RRO • Contributor

03-31-2022 3:12:14 AM

21768 Views
7 replies
7 kudos

Resolved! Performance for pyspark dataframe is very slow after using a @pandas_udf

Hello,I am currently working on a time series forecasting with FBProphet. Since I have data with many time series groups (~3000) I use a @pandas_udf to parallelize the training. @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def forecast_netprofit(pr...

Data Engineering

21768 Views
7 replies
7 kudos

03-31-2022 3:12:14 AM

View Replies

Latest Reply

RRO
Contributor

04-12-2022 8:01:24 AM

7 kudos

Thank you for the answers. Unfortunately this did not solve the performance issue.What I did now is I saved the results into a table:results.write.mode("overwrite").saveAsTable("db.results") This is probably not the best solution but after I do that ...

7 kudos

04-12-2022 8:01:24 AM

6 More Replies

by Abeeya • New Contributor II

04-01-2022 4:57:00 AM

3334 Views
2 replies
3 kudos

Resolved! How to Overwrite Using pyspark's JDBC without loosing constraints on table columns

Hello,My table has primary key constraint on a perticular column, Im loosing primary key constaint on that column each time I overwrite the table , What Can I do to preserve it? Any Heads up would be appreciatedTried Belowdf.write.option("truncate", ...

Data Engineering

3334 Views
2 replies
3 kudos

04-01-2022 4:57:00 AM

View Replies

Latest Reply

Kaniz
Community Manager

04-04-2022 12:32:47 AM

3 kudos

Hi @Abeeya . , How are you? Did @Hubert Dudek 's answer help you in any way? Please let us know.

3 kudos

04-04-2022 12:32:47 AM

1 More Replies

by DarshilDesai • New Contributor II

06-16-2020 6:18:09 PM

9982 Views
3 replies
3 kudos

Resolved! How to Efficiently Read Nested JSON in PySpark?

I am having trouble efficiently reading & parsing in a large number of stream files in Pyspark! Context Here is the schema of the stream file that I am reading in JSON. Blank spaces are edits for confidentiality purposes. root |-- location_info: ar...

Data Engineering

9982 Views
3 replies
3 kudos

06-16-2020 6:18:09 PM

View Replies

Latest Reply

Kaniz
Community Manager

03-29-2022 12:39:59 AM

3 kudos

Hi @Darshil Desai , How are you? Were you able to resolve your problem?

3 kudos

03-29-2022 12:39:59 AM

2 More Replies

by SailajaB • Valued Contributor III

02-09-2022 10:39:24 PM

4196 Views
12 replies
4 kudos

Resolved! JSON validation is getting failed after writing Pyspark dataframe to json format

Hi We have to convert transformed dataframe to json format. So we used write and json format on top of final dataframe to convert it to json. But when we validating the output json its not in proper json format.Could you please provide your suggestio...

Data Engineering

4196 Views
12 replies
4 kudos

02-09-2022 10:39:24 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-02-2022 9:01:40 AM

4 kudos

@Sailaja B - Does @Aman Sehgal's most recent answer help solve the problem? If it does, would you be happy to mark their answer as best?

4 kudos

03-02-2022 9:01:40 AM

11 More Replies

by hare • New Contributor III

02-25-2022 9:52:19 PM

1526 Views
4 replies
8 kudos

Azure DBR - Have to load list of json files but the column has special character.(ex: {"hydra:xxxx": {"hydra:value":"yyyy", "hydra:value1":"zzzzz"}

Azure DBR - Have to load list of json files into data frame and then from DF to data bricks table but the column has special character and getting below error.Both column(key) and value (as json record) has special characters in the json file. # Can...

Data Engineering

1526 Views
4 replies
8 kudos

02-25-2022 9:52:19 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

02-26-2022 10:05:47 AM

8 kudos

The best is just define schema manually. There is nice article from person who had exactly the same problem https://towardsdev.com/create-a-spark-hive-meta-store-table-using-nested-json-with-invalid-field-names-505f215eb5bf

8 kudos

02-26-2022 10:05:47 AM

3 More Replies

by Kaniz • Community Manager

09-22-2021 2:00:12 PM

461 Views
0 replies
0 kudos

How to create datetime columns in a pyspark dataframe?

Data Engineering

461 Views
0 replies
0 kudos

09-22-2021 2:00:12 PM