cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Christine
by Contributor
  • 4515 Views
  • 9 replies
  • 5 kudos

Resolved! pyspark dataframe empties after it has been saved to delta lake.

Hi, I am facing a problem that I hope to get some help to understand. I have created a function that is supposed to check if the input data already exist in a saved delta table and if not, it should create some calculations and append the new data to...

  • 4515 Views
  • 9 replies
  • 5 kudos
Latest Reply
SharathE
New Contributor II
  • 5 kudos

Hi,im also having similar issue ..does creating temp view and reading it again after saving to a table works?? /

  • 5 kudos
8 More Replies
FarBo
by New Contributor III
  • 3057 Views
  • 4 replies
  • 5 kudos

Spark issue handling data from json when the schema DataType mismatch occurs

Hi,I have encountered a problem using spark, when creating a dataframe from a raw json source.I have defined an schema for my data and the problem is that when there is a mismatch between one of the column values and its defined schema, spark not onl...

  • 3057 Views
  • 4 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

@Farzad Bonabi​ :Thank you for reporting this issue. It seems to be a known bug in Spark when dealing with malformed decimal values. When a decimal value in the input JSON data is not parseable by Spark, it sets not only that column to null but also ...

  • 5 kudos
3 More Replies
jonathan-dufaul
by Valued Contributor
  • 980 Views
  • 1 replies
  • 0 kudos

How do I specify column types when writing to a MSSQL server using the JDBC driver (

I have a pyspark dataframe that I'm writing to an on-prem MSSQL server--it's a stopgap while we convert data warehousing jobs over to databricks. The processes that use those tables in the on-prem server rely on the tables maintaining the identical s...

  • 980 Views
  • 1 replies
  • 0 kudos
Latest Reply
dasanro
New Contributor II
  • 0 kudos

It's happenging to me too!Did you find any solution @jonathan-dufaul  ?Thanks!!

  • 0 kudos
Skv
by New Contributor II
  • 1839 Views
  • 2 replies
  • 1 kudos

Resolved! Snowflake query with time travel not working from Databricks while reading into Dataframe.

I am trying to read the changes data from snowflake query into the dataframe using Databricks.Same query is working in snowflake but not in Databricks. Both sides timezones and format are same for the timestamp. I am trying to implement changetrackin...

  • 1839 Views
  • 2 replies
  • 1 kudos
Latest Reply
sher
Valued Contributor II
  • 1 kudos

you are format is wrong that's why you got an errortry thisSELECT * FROM TestTable CHANGES(INFORMATION => DEFAULT) AT(TIMESTAMP => TO_TIMESTAMP_TZ('2023-05-03 00:43:34.885','YYYY-MM-DD HH24:MI:SS.FF')) 

  • 1 kudos
1 More Replies
Rishitha
by New Contributor III
  • 1064 Views
  • 2 replies
  • 2 kudos

Resolved! Normalizing data from autoloader

I have data on s3 and i'm using autoloader to load the data. My json docs have fields which are array of structures. When I don't specify any schema the whole data is stored as strings even the array of structures are just a blob of string making it ...

  • 1064 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Rishitha Reddy​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us s...

  • 2 kudos
1 More Replies
kll
by New Contributor III
  • 3169 Views
  • 2 replies
  • 3 kudos

Nested struct type not supported pyspark error

I am attempting to apply a function to a pyspark DataFrame and save the API response to a new column and then parse using `json_normalize`. This works fine in pandas, however, I run into an exception with `pyspark`.  import pyspark.pandas as ps   i...

  • 3169 Views
  • 2 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hi @Keval Shah​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers yo...

  • 3 kudos
1 More Replies
frank7
by New Contributor II
  • 1751 Views
  • 2 replies
  • 1 kudos

Resolved! Is it possible to write a pyspark dataframe to a custom log table in Log Analytics workspace?

I have a pyspark dataframe that contains information about the tables that I have on sql database (creation date, number of rows, etc)Sample data: { "Day":"2023-04-28", "Environment":"dev", "DatabaseName":"default", "TableName":"discount"...

  • 1751 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Bruno Simoes​ :Yes, it is possible to write a PySpark DataFrame to a custom log table in Log Analytics workspace using the Azure Log Analytics Workspace API.Here's a high-level overview of the steps you can follow:Create an Azure Log Analytics Works...

  • 1 kudos
1 More Replies
DeviJaviya
by New Contributor II
  • 1205 Views
  • 2 replies
  • 0 kudos

Trying to build subquery in Databricks notebook, similar to SQL in a data frame with the Top(1)

Hello Everyone,I am new to Databricks, so I am at the learning stage. It would be very helpful if someone helps in resolving the issue or I can say helped me to fix my code.I have built the query that fetches the data based on CASE, in Case I have a ...

  • 1205 Views
  • 2 replies
  • 0 kudos
Latest Reply
DeviJaviya
New Contributor II
  • 0 kudos

Hello Rishabh,Thank you for your suggestion, we tried to limit 1 but the output values are coming the same for all the dates. which is not correct.

  • 0 kudos
1 More Replies
brian_0305
by New Contributor II
  • 2388 Views
  • 3 replies
  • 2 kudos

Use JDBC connect to databrick default cluster and read table into pyspark dataframe. All the column turned into same as column name

I used code like below to Use JDBC connect to databrick default cluster and read table into pyspark dataframeurl = 'jdbc:databricks://[workspace domain]:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=[path];AuthMech=3;UID=token;PWD=[your_ac...

error
  • 2388 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@yu zhang​ :It looks like the issue with the first code snippet you provided is that it is not specifying the correct query to retrieve the data from your database.When using the load() method with the jdbc data source, you need to provide a SQL quer...

  • 2 kudos
2 More Replies
Vindhya
by New Contributor II
  • 1226 Views
  • 2 replies
  • 0 kudos

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))"

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))", PFB screenshot for more details

sccreenshot
  • 1226 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Vindhya D​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

  • 0 kudos
1 More Replies
elgeo
by Valued Contributor II
  • 4357 Views
  • 2 replies
  • 0 kudos

Resolved! Iteration - Pyspark vs Pandas

Hello. Could someone please explain why iteration over a Pyspark dataframe is way slower than over a Pandas dataframe?Pysparkdf_list = df.collect()for index in range(0, len(df_list )):.....Pandasdf_pnd = df.toPandas()           for index, row in df_p...

  • 4357 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @ELENI GEORGOUSI​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us ...

  • 0 kudos
1 More Replies
maartenvr
by New Contributor III
  • 14491 Views
  • 9 replies
  • 2 kudos

Resolved! Unable to clear cache using a pyspark session

Hi all,I am using a persist call on a spark dataframe inside an application to speed-up computations. The dataframe is used throughout my application and at the end of the application I am trying to clear the cache of the whole spark session by calli...

  • 14491 Views
  • 9 replies
  • 2 kudos
Latest Reply
maartenvr
New Contributor III
  • 2 kudos

No solution yet:Hi @Suteja Kanuri​ ,Thank you for thinking along and replying!Unfortunately, I have not found a solution yet.I am getting an error that there exists no ```.getCache()``` method on a spark context. Also note that I have tried to do som...

  • 2 kudos
8 More Replies
Mado
by Valued Contributor II
  • 3239 Views
  • 4 replies
  • 1 kudos

Resolved! How to set properties for a delta table when I want to write a DataFrame?

Hi,I have a PySpark DataFrame with 11 million records. I created the DataFrame on a cluster. It is not saved on DBFS or storage account. import pyspark.sql.functions as F from pyspark.sql.functions import col, when, floor, expr, hour, minute, to_time...

  • 3239 Views
  • 4 replies
  • 1 kudos
Latest Reply
Lakshay
Esteemed Contributor
  • 1 kudos

Hi @Mohammad Saber​ , Are you getting the error while writing the file to the table? Or before that?

  • 1 kudos
3 More Replies
uzairm
by New Contributor III
  • 5038 Views
  • 2 replies
  • 2 kudos

Resolved! ThreadPoolExecutor in Databricks

I am using a threadpool executor and running notebooks in parallel. However, these parallel notebooks are not using executors at all and all the load is going towards the driver node resulting in running out of memory for the driver node and eventual...

  • 5038 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @uzair mustafa​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedbac...

  • 2 kudos
1 More Replies
raghub1
by New Contributor II
  • 4774 Views
  • 5 replies
  • 3 kudos

Resolved! Writing PySpark DataFrame onto AWS Glue throwing error

I have followed the steps as mentioned in this blog : https://www.linkedin.com/pulse/aws-glue-data-catalog-metastore-databricks-deepak-rajak/ but when trying to saveAsTable(table_name), it is giving an error as IllegalArgumentException: Path must be ...

  • 4774 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hey @Raghu Bharadwaj Tallapragada​ Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!

  • 3 kudos
4 More Replies
Labels