cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

twotwoiscute
by New Contributor
  • 1476 Views
  • 1 replies
  • 0 kudos

PySpark pandas_udf slower than single thread

I used @pandas_udf write a function for speeding up the process(parsing xml file ) and then compare it's speed with single thread , Surprisingly , Using @pandas_udf is two times slower than single-thread code. And the number of xml files I need to p...

  • 1476 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @ twotwoiscute ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response...

  • 0 kudos
HarisKhan
by New Contributor
  • 10270 Views
  • 2 replies
  • 0 kudos

Escape Backslash(/) while writing spark dataframe into csv

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue. I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv. I have so...

  • 10270 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

I'm confused - you say the escape is backslash, but you show forward slashes in your data. Don't you want the escape to be forward slash?

  • 0 kudos
1 More Replies
Anbazhagananbut
by New Contributor II
  • 3202 Views
  • 2 replies
  • 0 kudos

Pyspark Convert Struct Type to Map Type

Hello Sir, Could you please advise the below scenario in pyspark 2.4.3 in data-bricksto load the data into the delta table.I want to load the dataframe with this column "data" into the table as Maptype in the data-bricks spark delta table.could you ...

  • 3202 Views
  • 2 replies
  • 0 kudos
Latest Reply
sherryellis
New Contributor II
  • 0 kudos

you can do it by making an api request - convert png to ico paint/api/2.0/clusters/permanent-delete i dont see an option to delete or edit an automated cluster from UI.

  • 0 kudos
1 More Replies
Anbazhagananbut
by New Contributor II
  • 6658 Views
  • 1 replies
  • 0 kudos

Get Size of a column in Bytes for a Pyspark Data frame

Hello All, I have a column in a dataframe which i struct type.I want to find the size of the column in bytes.it is getting failed while loading in snowflake.I could see size functions avialable to get the length.how to calculate the size in bytes fo...

  • 6658 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

There isn't one size for a column; it takes some amount of bytes in memory, but a different amount potentially when serialized on disk or stored in Parquet. You can work out the size in memory from its data type; an array of 100 bytes takes 100 byte...

  • 0 kudos
Anbazhagananbut
by New Contributor II
  • 9415 Views
  • 1 replies
  • 1 kudos

How to handle Blank values in Array of struct elements in pyspark

Hello All, We have a data in a column in pyspark dataframe having array of struct typehaving multiple nested fields present.if the value is not blank it will savethe data in the same array of struct type in spark delta table.please advise on the bel...

  • 9415 Views
  • 1 replies
  • 1 kudos
Latest Reply
shyam_9
Valued Contributor
  • 1 kudos

Hi @Anbazhagan anbutech17,Can you please try as in below answers,https://stackoverflow.com/questions/56942683/how-to-add-null-columns-to-complex-array-struct-in-spark-with-a-udf

  • 1 kudos
RohiniMathur
by New Contributor II
  • 14286 Views
  • 1 replies
  • 0 kudos

Resolved! Length Value of a column in pyspark

Hello, i am using pyspark 2.12 After Creating Dataframe can we measure the length value for each row. For Example: I am measuring length of a value in column 2 Input file |TYCO|1303| |EMC |120989| |VOLVO|102329| |BMW|130157| |FORD|004| Output ...

  • 14286 Views
  • 1 replies
  • 0 kudos
Latest Reply
lee
Contributor
  • 0 kudos

You can use the length function for this from pyspark.sql.functions import length mock_data = [('TYCO', '1303'),('EMC', '120989'), ('VOLVO', '102329'),('BMW', '130157'),('FORD', '004')] df = spark.createDataFrame(mock_data, ['col1', 'col2']) df2 = d...

  • 0 kudos
RohiniMathur
by New Contributor II
  • 20169 Views
  • 4 replies
  • 0 kudos

Removing non-ascii and special character in pyspark

i am running spark 2.4.4 with python 2.7 and IDE is pycharm. The Input file (.csv) contain encoded value in some column like given below. File data looks COL1,COL2,COL3,COL4 CM, 503004, (d$όνυ$F|'.h*Λ!ψμ=(.ξ; ,.ʽ|!3-2-704 The output i am trying ...

  • 20169 Views
  • 4 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Rohini Mathur, use below code on column containing non-ascii and special characters.df['column_name'].str.encode('ascii', 'ignore').str.decode('ascii')

  • 0 kudos
3 More Replies
siddhu308
by New Contributor II
  • 5712 Views
  • 2 replies
  • 0 kudos

column wise sum in PySpark dataframe

i have a dataframe of 18000000rows and 1322 column with '0' and '1' value. want to find how many '1's are in every column ??? below is DataSet se_00001 se_00007 se_00036 se_00100 se_0010p se_00250

  • 5712 Views
  • 2 replies
  • 0 kudos
Latest Reply
mathan_pillai
Valued Contributor
  • 0 kudos

Hi Siddhu, You can use df.select(sum("col1"), sum("col2"), sum("col3")) where col1, col2, col3 are the column names for which you would like to find the sum please let us know if it answers your question Thanks

  • 0 kudos
1 More Replies
srchella
by New Contributor
  • 2748 Views
  • 1 replies
  • 0 kudos

How to take distinct of multiple columns ( > than 2 columns) in pyspark datafarme ?

I have 10+ columns and want to take distinct rows by multiple columns into consideration. How to achieve this using pyspark dataframe functions ?

  • 2748 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sandeep
Contributor III
  • 0 kudos

You can use dropDuplicates https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=distinct#pyspark.sql.DataFrame.dropDuplicates

  • 0 kudos
vin007
by New Contributor
  • 6878 Views
  • 1 replies
  • 0 kudos

How to store a pyspark dataframe in S3 bucket.

I have a pyspark dataframe df containing 4 columns. How can I write this dataframe to s3 bucket? I'm using pycharm to execute the code. and what are the packages required to be installed?

  • 6878 Views
  • 1 replies
  • 0 kudos
Latest Reply
AndrewSears
New Contributor III
  • 0 kudos

You shouldn't need any packages. You can mount S3 bucket to Databricks cluster. https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3 or this http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark...

  • 0 kudos
Labels