cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

hare
by New Contributor III
  • 3299 Views
  • 4 replies
  • 3 kudos

Implementation of Late arriving dimension in databricks

Hi Team, Can you please suggest to me how to implement the late arriving dimension or early arriving fact with examples or any sample script for reference? I have to implement the same using pyspark.Thanks.

  • 3299 Views
  • 4 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hi @Hare Krishnan​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Than...

  • 3 kudos
3 More Replies
zak
by New Contributor II
  • 4296 Views
  • 1 replies
  • 1 kudos

add custom metadata to avro file with pyspark

Hello, i need to add a custom metadata into a avro file. The avro file containt data. we have tried to use "option" within the write function but it's not taken without generated any error.df.write.format("avro").option("avro.codec", "snappy").option...

  • 4296 Views
  • 1 replies
  • 1 kudos
chanansh
by Contributor
  • 1573 Views
  • 1 replies
  • 0 kudos

QueryExecutionListener cannot be found in pyspark

According to the documentation you can monitor a spark structure stream job using QueryExecutionListener. However I cannot find it. https://docs.databricks.com/structured-streaming/stream-monitoring.html#language-python

  • 1573 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Which DBR version are you using? also, can you share some code snippet on how you are using the QueryExecutionListener?

  • 0 kudos
lambarc
by New Contributor II
  • 14062 Views
  • 7 replies
  • 13 kudos

How to read file in pyspark with “]|[” delimiter

The data looks like this: pageId]|[page]|[Position]|[sysId]|[carId 0005]|[bmw]|[south]|[AD6]|[OP4 There are atleast 50 columns and millions of rows. I did try to use below code to read: dff = sqlContext.read.format("com.databricks.spark.csv").option...

  • 14062 Views
  • 7 replies
  • 13 kudos
Latest Reply
rohit199912
New Contributor II
  • 13 kudos

you might also try the blow option.1). Use a different file format: You can try using a different file format that supports multi-character delimiters, such as text JSON.2). Use a custom Row class: You can write a custom Row class to parse the multi-...

  • 13 kudos
6 More Replies
Callum
by New Contributor II
  • 12705 Views
  • 3 replies
  • 2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

  • 12705 Views
  • 3 replies
  • 2 kudos
Latest Reply
Serlal
New Contributor III
  • 2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

  • 2 kudos
2 More Replies
BF
by New Contributor II
  • 6581 Views
  • 3 replies
  • 2 kudos

Resolved! Pyspark - How do I convert date/timestamp of format like /Date(1593786688000+0200)/ in pyspark?

Hi all, I've a dataframe with CreateDate column with this format:CreateDate/Date(1593786688000+0200)//Date(1446032157000+0100)//Date(1533904635000+0200)//Date(1447839805000+0100)//Date(1589451249000+0200)/and I want to convert that format to date/tim...

  • 6581 Views
  • 3 replies
  • 2 kudos
Latest Reply
Chaitanya_Raju
Honored Contributor
  • 2 kudos

Hi @Bruno Franco​ ,Can you please try the below code, hope it might for you.from pyspark.sql.functions import from_unixtime from pyspark.sql import functions as F final_df = df_src.withColumn("Final_Timestamp", from_unixtime((F.regexp_extract(col("Cr...

  • 2 kudos
2 More Replies
rammy
by Contributor III
  • 2734 Views
  • 2 replies
  • 3 kudos

How can we save a data frame in Docx format using pyspark?

  I am trying to save a data frame into a document but it returns saying that the below errorjava.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.htm   #f_d...

  • 2734 Views
  • 2 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

Hi,You cannot do it from Pyspark, but you can try to use Pandas to save to Excell. There is no Docx

  • 3 kudos
1 More Replies
Ancil
by Contributor II
  • 3059 Views
  • 3 replies
  • 1 kudos

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 1 rows, but I tried with more than one rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output w...

  • 3059 Views
  • 3 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

I was testing, and your function is correct. So you need to have an error in inputData type (is all string) or with result_json. Please also check the runtime version. I was using 11 LTS. 

  • 1 kudos
2 More Replies
KrishZ
by Contributor
  • 17781 Views
  • 3 replies
  • 3 kudos

[Pyspark.Pandas] PicklingError: Could not serialize object (this error is happening only for large datasets)

Context: I am using pyspark.pandas in a Databricks jupyter notebook and doing some text manipulation within the dataframe..pyspark.pandas is the Pandas API on Spark and can be used exactly the same as usual PandasError: PicklingError: Could not seria...

  • 17781 Views
  • 3 replies
  • 3 kudos
Latest Reply
ryojikn
New Contributor III
  • 3 kudos

@Krishna Zanwar​ , i'm receiving the same error.​For me, the behavior is when trying to broadcast a random forest (sklearn 1.2.0) recently loaded from mlflow, and using Pandas UDF to predict a model.​However, the same code works perfectly on Spark 2....

  • 3 kudos
2 More Replies
lmcglone
by New Contributor II
  • 6702 Views
  • 2 replies
  • 3 kudos

Comparing 2 dataframes and create columns from values within a dataframe

Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...

image
  • 6702 Views
  • 2 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()

  • 3 kudos
1 More Replies
shamly
by New Contributor III
  • 3877 Views
  • 4 replies
  • 2 kudos

How to replace LF and replace with ' ' in csv UTF-16 encoded?

I have tried several code and nothing worked. An extra space or line LF is going to next row in my output. All rows are ending in CRLF, but some rows end in LF and while reading the csv, it is not giving correct output. My csv have double dagger as d...

  • 3877 Views
  • 4 replies
  • 2 kudos
Latest Reply
sher
Valued Contributor II
  • 2 kudos

val df = spark.read.format("csv") .option("header",true) .option("sep","||") .load("file load") display(df)   try this

  • 2 kudos
3 More Replies
pjp94
by Contributor
  • 8845 Views
  • 9 replies
  • 7 kudos

Calling a python function (def) in databricks

Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions? 1) def task(x): y = dostuf...

  • 8845 Views
  • 9 replies
  • 7 kudos
Latest Reply
sher
Valued Contributor II
  • 7 kudos

don't use python normal function use UDF in pyspark so that will be faster

  • 7 kudos
8 More Replies
Rishabh-Pandey
by Esteemed Contributor
  • 1224 Views
  • 1 replies
  • 5 kudos

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to...

PrivilegesSELECT: gives read access to an object.CREATE: gives ability to create an object (for example, a table in a schema).MODIFY: gives ability to add, delete, and modify data to or from an object.USAGE: does not give any abilities, but is an add...

  • 1224 Views
  • 1 replies
  • 5 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 5 kudos

thanks sir

  • 5 kudos
Rishabh-Pandey
by Esteemed Contributor
  • 3232 Views
  • 0 replies
  • 6 kudos

To connect Delta Lake with Microsoft Excel, you can use the Microsoft Power Query for Excel add-in. Power Query is a data connection tool that allows ...

To connect Delta Lake with Microsoft Excel, you can use the Microsoft Power Query for Excel add-in. Power Query is a data connection tool that allows you to connect to various data sources, including Delta Lake. Here's how to do it:Install the Micros...

  • 3232 Views
  • 0 replies
  • 6 kudos
MC006
by New Contributor III
  • 7216 Views
  • 4 replies
  • 2 kudos

Resolved! java.lang.NoSuchMethodError after upgrade to Databricks Runtime 11.3 LTS

Hi,  I am using Databricks and want to upgrade to Databricks runtime version 11.3 LTS which uses Spark 3.3 now. Current system enviroment:Operating System: Ubuntu 20.04.4 LTSJava: Zulu 8.56.0.21-CA-linux64Python: 3.8.10Delta Lake: 1.1.0Target system ...

  • 7216 Views
  • 4 replies
  • 2 kudos
Latest Reply
Meghala
Valued Contributor II
  • 2 kudos

Hi everyone this data was helped me thanks ​

  • 2 kudos
3 More Replies
Labels