cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Mado
by Valued Contributor II
  • 14631 Views
  • 2 replies
  • 6 kudos

Resolved! Difference between "spark.table" & "spark.read.table"?

Hi,I want to make a PySpark DataFrame from a Table. I would like to ask about the difference of the following commands:spark.read.table(TableName)&spark.table(TableName)Both return PySpark DataFrame and look similar. Thanks.

  • 14631 Views
  • 2 replies
  • 6 kudos
Latest Reply
Mado
Valued Contributor II
  • 6 kudos

Hi @Kaniz Fatma​ I selected answer from @Kedar Deshpande​ as the best answer.

  • 6 kudos
1 More Replies
RamaSantosh
by New Contributor II
  • 3737 Views
  • 2 replies
  • 3 kudos

Data load from Azure databricks dataframe to cosmos db container

I am trying to load data from Azure databricks dataframe to cosmos db container using below commandcfg = { "spark.cosmos.accountEndpoint" : cosmosEndpoint, "spark.cosmos.accountKey" : cosmosMasterKey, "spark.cosmos.database" : cosmosDatabaseName, "sp...

  • 3737 Views
  • 2 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hey @Rama Santosh Ravada​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from y...

  • 3 kudos
1 More Replies
anonymous1
by New Contributor III
  • 6584 Views
  • 7 replies
  • 5 kudos

How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

Schema Design :Source : Miltiple CSV Files like (SourceFile1 ,SourceFile2)Target : Delta Table like (Target_Table)Excel File : ETL_Mapping_SheetFile Columns : SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogicMappingLogic columns cont...

image
  • 6584 Views
  • 7 replies
  • 5 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 5 kudos

Following on @Werner Stinckens​ response, if you can give an example then it will be good.Ideally you can read each row from excel file in python and pass each column as a parameter to a function.Eg; def apply_mapping_logic(SourceTable ,SourceColumn,...

  • 5 kudos
6 More Replies
KNP
by New Contributor
  • 2779 Views
  • 2 replies
  • 0 kudos

passing array as a parameter to PandasUDF

Hi Team,My python dataframe is as below.The raw data is quite a long series of approx 5000 numbers. My requirement is to go through each row in RawData column and calculate 2 metrics. I have created a function in Python and it works absolutely fine. ...

image
  • 2779 Views
  • 2 replies
  • 0 kudos
Latest Reply
Vidula
Honored Contributor
  • 0 kudos

Hello @Kausthub NP​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 0 kudos
1 More Replies
KumarShiv
by New Contributor III
  • 1800 Views
  • 2 replies
  • 2 kudos

Resolved! Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.

I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.Here are both result sets for Pyspark ...

  • 1800 Views
  • 2 replies
  • 2 kudos
Latest Reply
artsheiko
Honored Contributor
  • 2 kudos

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

  • 2 kudos
1 More Replies
NathanLaw
by New Contributor III
  • 4151 Views
  • 5 replies
  • 1 kudos

Model Training Data Adapter Error.

We are converting Pyspark dataframe to Tensorflow using PetaStorm and have encountered a “data adapter” error. What do you recommend for diagnosing and fixing this error?https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/...

DataAdpaterErrorCluster DataAdpaterError
  • 4151 Views
  • 5 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hey @Nathan Law​ Thank you so much for getting back to us. We will await your response.We really appreciate your time.

  • 1 kudos
4 More Replies
Eyespoop
by New Contributor II
  • 17769 Views
  • 3 replies
  • 2 kudos

Resolved! PySpark: Writing Parquet Files to the Azure Blob Storage Container

Currently I am having some issues with the writing of the parquet file in the Storage Container. I do have the codes running but whenever the dataframe writer puts the parquet to the blob storage instead of the parquet file type, it is created as a f...

image image(1) image(2)
  • 17769 Views
  • 3 replies
  • 2 kudos
Latest Reply
User16764241763
Honored Contributor
  • 2 kudos

Hello @Karl Saycon​ Can you try setting this config to prevent additional parquet summary and metadata files from being written? The result from dataframe write to storage should be a single file.https://community.databricks.com/s/question/0D53f00001...

  • 2 kudos
2 More Replies
SusuTheSeeker
by New Contributor III
  • 3748 Views
  • 7 replies
  • 3 kudos

Kernel switches to unknown using pyspark

I am working in jupyter hub in a notebook. I am using pyspark dataframe for analyzing text. More precisely I am doing sentimment analysis of newspaper articles. The code works until I get to some point where the kernel is busy and after approximately...

  • 3748 Views
  • 7 replies
  • 3 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 3 kudos

do you actually run the code on a distributed environment (meaning a driver and multiple workers)?If not, there is no use in using pyspark as all code will be executed locally.

  • 3 kudos
6 More Replies
RRO
by Contributor
  • 29040 Views
  • 6 replies
  • 7 kudos

Resolved! Performance for pyspark dataframe is very slow after using a @pandas_udf

Hello,I am currently working on a time series forecasting with FBProphet. Since I have data with many time series groups (~3000) I use a @pandas_udf to parallelize the training. @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def forecast_netprofit(pr...

  • 29040 Views
  • 6 replies
  • 7 kudos
Latest Reply
RRO
Contributor
  • 7 kudos

Thank you for the answers. Unfortunately this did not solve the performance issue.What I did now is I saved the results into a table:results.write.mode("overwrite").saveAsTable("db.results") This is probably not the best solution but after I do that ...

  • 7 kudos
5 More Replies
Abeeya
by New Contributor II
  • 5009 Views
  • 1 replies
  • 5 kudos

Resolved! How to Overwrite Using pyspark's JDBC without loosing constraints on table columns

Hello,My table has primary key constraint on a perticular column, Im loosing primary key constaint on that column each time I overwrite the table , What Can I do to preserve it? Any Heads up would be appreciatedTried Belowdf.write.option("truncate", ...

  • 5009 Views
  • 1 replies
  • 5 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 5 kudos

@Abeeya .​ , Mode "truncate", is correct to preserve the table. However, when you want to add a new column (mismatched schema), it wants to drop it anyway.

  • 5 kudos
SailajaB
by Valued Contributor III
  • 7060 Views
  • 12 replies
  • 4 kudos

Resolved! JSON validation is getting failed after writing Pyspark dataframe to json format

Hi We have to convert transformed dataframe to json format. So we used write and json format on top of final dataframe to convert it to json. But when we validating the output json its not in proper json format.Could you please provide your suggestio...

  • 7060 Views
  • 12 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Sailaja B​ - Does @Aman Sehgal​'s most recent answer help solve the problem? If it does, would you be happy to mark their answer as best?

  • 4 kudos
11 More Replies
hare
by New Contributor III
  • 2447 Views
  • 4 replies
  • 8 kudos

Azure DBR - Have to load list of json files but the column has special character.(ex: {"hydra:xxxx": {"hydra:value":"yyyy", "hydra:value1":"zzzzz"}

Azure DBR - Have to load list of json files into data frame and then from DF to data bricks table but the column has special character and getting below error.Both column(key) and value (as json record) has special characters in the json file. # Can...

  • 2447 Views
  • 4 replies
  • 8 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 8 kudos

The best is just define schema manually. There is nice article from person who had exactly the same problem https://towardsdev.com/create-a-spark-hive-meta-store-table-using-nested-json-with-invalid-field-names-505f215eb5bf

  • 8 kudos
3 More Replies
DarshilDesai
by New Contributor II
  • 12504 Views
  • 1 replies
  • 3 kudos

Resolved! How to Efficiently Read Nested JSON in PySpark?

I am having trouble efficiently reading & parsing in a large number of stream files in Pyspark! Context Here is the schema of the stream file that I am reading in JSON. Blank spaces are edits for confidentiality purposes. root |-- location_info: ar...

  • 12504 Views
  • 1 replies
  • 3 kudos
Latest Reply
Chris_Shehu
Valued Contributor III
  • 3 kudos

I'm interested in seeing what others have come up with. Currently I'm using Json. normalize() then taking any additional nested statements and using a loop to pull them out -> re-combine them.

  • 3 kudos
Labels