cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Mado
by Valued Contributor II
  • 4664 Views
  • 4 replies
  • 4 kudos

Resolved! Difference between "spark.table" & "spark.read.table"?

Hi,I want to make a PySpark DataFrame from a Table. I would like to ask about the difference of the following commands:spark.read.table(TableName)&spark.table(TableName)Both return PySpark DataFrame and look similar. Thanks.

  • 4664 Views
  • 4 replies
  • 4 kudos
Latest Reply
Mado
Valued Contributor II
  • 4 kudos

Hi @Kaniz Fatma​ I selected answer from @Kedar Deshpande​ as the best answer.

  • 4 kudos
3 More Replies
RamaSantosh
by New Contributor II
  • 2549 Views
  • 2 replies
  • 3 kudos

Data load from Azure databricks dataframe to cosmos db container

I am trying to load data from Azure databricks dataframe to cosmos db container using below commandcfg = { "spark.cosmos.accountEndpoint" : cosmosEndpoint, "spark.cosmos.accountKey" : cosmosMasterKey, "spark.cosmos.database" : cosmosDatabaseName, "sp...

  • 2549 Views
  • 2 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hey @Rama Santosh Ravada​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from y...

  • 3 kudos
1 More Replies
anonymous1
by New Contributor III
  • 3841 Views
  • 7 replies
  • 5 kudos

How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

Schema Design :Source : Miltiple CSV Files like (SourceFile1 ,SourceFile2)Target : Delta Table like (Target_Table)Excel File : ETL_Mapping_SheetFile Columns : SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogicMappingLogic columns cont...

image
  • 3841 Views
  • 7 replies
  • 5 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 5 kudos

Following on @Werner Stinckens​ response, if you can give an example then it will be good.Ideally you can read each row from excel file in python and pass each column as a parameter to a function.Eg; def apply_mapping_logic(SourceTable ,SourceColumn,...

  • 5 kudos
6 More Replies
KNP
by New Contributor
  • 1887 Views
  • 2 replies
  • 0 kudos

passing array as a parameter to PandasUDF

Hi Team,My python dataframe is as below.The raw data is quite a long series of approx 5000 numbers. My requirement is to go through each row in RawData column and calculate 2 metrics. I have created a function in Python and it works absolutely fine. ...

image
  • 1887 Views
  • 2 replies
  • 0 kudos
Latest Reply
Vidula
Honored Contributor
  • 0 kudos

Hello @Kausthub NP​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 0 kudos
1 More Replies
KumarShiv
by New Contributor III
  • 1132 Views
  • 2 replies
  • 2 kudos

Resolved! Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.

I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.Here are both result sets for Pyspark ...

  • 1132 Views
  • 2 replies
  • 2 kudos
Latest Reply
artsheiko
Valued Contributor III
  • 2 kudos

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

  • 2 kudos
1 More Replies
NathanLaw
by New Contributor III
  • 2331 Views
  • 8 replies
  • 1 kudos

Model Training Data Adapter Error.

We are converting Pyspark dataframe to Tensorflow using PetaStorm and have encountered a “data adapter” error. What do you recommend for diagnosing and fixing this error?https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/...

DataAdpaterErrorCluster DataAdpaterError
  • 2331 Views
  • 8 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hey @Nathan Law​ Thank you so much for getting back to us. We will await your response.We really appreciate your time.

  • 1 kudos
7 More Replies
Eyespoop
by New Contributor II
  • 11748 Views
  • 3 replies
  • 2 kudos

Resolved! PySpark: Writing Parquet Files to the Azure Blob Storage Container

Currently I am having some issues with the writing of the parquet file in the Storage Container. I do have the codes running but whenever the dataframe writer puts the parquet to the blob storage instead of the parquet file type, it is created as a f...

image image(1) image(2)
  • 11748 Views
  • 3 replies
  • 2 kudos
Latest Reply
User16764241763
Honored Contributor
  • 2 kudos

Hello @Karl Saycon​ Can you try setting this config to prevent additional parquet summary and metadata files from being written? The result from dataframe write to storage should be a single file.https://community.databricks.com/s/question/0D53f00001...

  • 2 kudos
2 More Replies
SusuTheSeeker
by New Contributor III
  • 2113 Views
  • 8 replies
  • 3 kudos

Kernel switches to unknown using pyspark

I am working in jupyter hub in a notebook. I am using pyspark dataframe for analyzing text. More precisely I am doing sentimment analysis of newspaper articles. The code works until I get to some point where the kernel is busy and after approximately...

  • 2113 Views
  • 8 replies
  • 3 kudos
Latest Reply
Kaniz
Community Manager
  • 3 kudos

Hi @Suad Hidbani​ ​, We haven’t heard from you on the last responses from us, and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will...

  • 3 kudos
7 More Replies
RRO
by Contributor
  • 21768 Views
  • 7 replies
  • 7 kudos

Resolved! Performance for pyspark dataframe is very slow after using a @pandas_udf

Hello,I am currently working on a time series forecasting with FBProphet. Since I have data with many time series groups (~3000) I use a @pandas_udf to parallelize the training. @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def forecast_netprofit(pr...

  • 21768 Views
  • 7 replies
  • 7 kudos
Latest Reply
RRO
Contributor
  • 7 kudos

Thank you for the answers. Unfortunately this did not solve the performance issue.What I did now is I saved the results into a table:results.write.mode("overwrite").saveAsTable("db.results") This is probably not the best solution but after I do that ...

  • 7 kudos
6 More Replies
Abeeya
by New Contributor II
  • 3334 Views
  • 2 replies
  • 3 kudos

Resolved! How to Overwrite Using pyspark's JDBC without loosing constraints on table columns

Hello,My table has primary key constraint on a perticular column, Im loosing primary key constaint on that column each time I overwrite the table , What Can I do to preserve it? Any Heads up would be appreciatedTried Belowdf.write.option("truncate", ...

  • 3334 Views
  • 2 replies
  • 3 kudos
Latest Reply
Kaniz
Community Manager
  • 3 kudos

Hi @Abeeya .​ , How are you? Did @Hubert Dudek​ 's answer help you in any way? Please let us know.

  • 3 kudos
1 More Replies
DarshilDesai
by New Contributor II
  • 9982 Views
  • 3 replies
  • 3 kudos

Resolved! How to Efficiently Read Nested JSON in PySpark?

I am having trouble efficiently reading & parsing in a large number of stream files in Pyspark! Context Here is the schema of the stream file that I am reading in JSON. Blank spaces are edits for confidentiality purposes. root |-- location_info: ar...

  • 9982 Views
  • 3 replies
  • 3 kudos
Latest Reply
Kaniz
Community Manager
  • 3 kudos

Hi @Darshil Desai​ , How are you? Were you able to resolve your problem?

  • 3 kudos
2 More Replies
SailajaB
by Valued Contributor III
  • 4196 Views
  • 12 replies
  • 4 kudos

Resolved! JSON validation is getting failed after writing Pyspark dataframe to json format

Hi We have to convert transformed dataframe to json format. So we used write and json format on top of final dataframe to convert it to json. But when we validating the output json its not in proper json format.Could you please provide your suggestio...

  • 4196 Views
  • 12 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Sailaja B​ - Does @Aman Sehgal​'s most recent answer help solve the problem? If it does, would you be happy to mark their answer as best?

  • 4 kudos
11 More Replies
hare
by New Contributor III
  • 1526 Views
  • 4 replies
  • 8 kudos

Azure DBR - Have to load list of json files but the column has special character.(ex: {"hydra:xxxx": {"hydra:value":"yyyy", "hydra:value1":"zzzzz"}

Azure DBR - Have to load list of json files into data frame and then from DF to data bricks table but the column has special character and getting below error.Both column(key) and value (as json record) has special characters in the json file. # Can...

  • 1526 Views
  • 4 replies
  • 8 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 8 kudos

The best is just define schema manually. There is nice article from person who had exactly the same problem https://towardsdev.com/create-a-spark-hive-meta-store-table-using-nested-json-with-invalid-field-names-505f215eb5bf

  • 8 kudos
3 More Replies
Labels