cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Venkata_Krishna
by New Contributor
  • 10528 Views
  • 1 replies
  • 0 kudos

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.

  • 10528 Views
  • 1 replies
  • 0 kudos
Latest Reply
lee
Contributor
  • 0 kudos

You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...

  • 0 kudos
MithuWagh
by New Contributor
  • 9230 Views
  • 1 replies
  • 0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

  • 9230 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

  • 0 kudos
KrisMusial
by New Contributor
  • 7020 Views
  • 2 replies
  • 0 kudos

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success. I minimized the code and reproduced the issue with the following two cells: > case class MyClass(val fld1: Integer, val fld2: Integer) > > val lst1 = sc.paralle...

  • 7020 Views
  • 2 replies
  • 0 kudos
Latest Reply
Guru421421
New Contributor II
  • 0 kudos

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

  • 0 kudos
1 More Replies
NandhaKumar
by New Contributor II
  • 7201 Views
  • 3 replies
  • 0 kudos

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

I have created a databricks in azure. I have created a cluster for python 3. I am creating a job using spark-submit parameters. How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in ...

  • 7201 Views
  • 3 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Nandha Kumar,please go through the below docs to pass python files as job,https://docs.databricks.com/dev-tools/api/latest/jobs.html#sparkpythontask

  • 0 kudos
2 More Replies
cfregly
by Contributor
  • 5343 Views
  • 4 replies
  • 0 kudos
  • 5343 Views
  • 4 replies
  • 0 kudos
Latest Reply
GeethGovindSrin
New Contributor II
  • 0 kudos

@cfregly​ : For DataFrames, you can use the following code for using groupBy without aggregations.Df.groupBy(Df["column_name"]).agg({})

  • 0 kudos
3 More Replies
tourist_on_road
by New Contributor
  • 7019 Views
  • 1 replies
  • 0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

  • 7019 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

  • 0 kudos
MikeK_
by New Contributor II
  • 15200 Views
  • 1 replies
  • 0 kudos

Resolved! SQL variables in a notebook

Hi, In an SQL notebook, using this link: https://docs.databricks.com/spark/latest/spark-sql/language-manual/set.html I managed to figure out to set values and how to get the value. SET my_val=10; //saves the value 10 for key my_val SET my_val; //dis...

  • 15200 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Mike K.., you can do this with widgets and getArgument. Here's a small example of what that might look like: https://community.databricks.com/s/feed/0D53f00001HKHZfCAP

  • 0 kudos
kruhly
by New Contributor II
  • 38662 Views
  • 12 replies
  • 0 kudos

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example belowllist = [(...

  • 38662 Views
  • 12 replies
  • 0 kudos
Latest Reply
TejuNC
New Contributor II
  • 0 kudos

This is an expected behavior. DataFrame.join method is equivalent to SQL join like thisSELECT*FROM a JOIN b ON joinExprsIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you c...

  • 0 kudos
11 More Replies
Pierrek20
by New Contributor
  • 16600 Views
  • 2 replies
  • 0 kudos

How to loop over spark dataframe with scala ?

Hello ! I 'm rookie to spark scala, here is my problem : tk's in advance for your help my input dataframe looks like this : index bucket time ap station rssi 0 1 00:00 1 1 -84.0 1 1 00:00 1 3 -67.0 2 1 00:00 1 4 -82.0 3 1 00:00 1 2 -68.0 4 1 00:00...

  • 16600 Views
  • 2 replies
  • 0 kudos
Latest Reply
Eve
New Contributor III
  • 0 kudos

Looping is not always necessary, I always use this foreach method, something like the following: aps.collect().foreach(row => <do something>)

  • 0 kudos
1 More Replies
1stcommander
by New Contributor II
  • 10091 Views
  • 2 replies
  • 0 kudos

Parquet partitionBy - date column to nested folders

Hi, when writing a DataFrame to parquet using partitionBy(<date column>), the resulting folder structure looks like this: root |----------------- day1 |----------------- day2 |----------------- day3 Is it possible to create a structure like to foll...

  • 10091 Views
  • 2 replies
  • 0 kudos
Latest Reply
Saphira
New Contributor II
  • 0 kudos

Hey @1stcommander​ You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. Im not sure what your problem is with the partition pruning. It...

  • 0 kudos
1 More Replies
paourissi
by New Contributor
  • 10946 Views
  • 2 replies
  • 1 kudos

When to persist and when to unpersist RDD in Spark

Lets say i have the following:<code>val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) val dataset3 = dataset2.map(.....)1) 1)If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist ...

  • 10946 Views
  • 2 replies
  • 1 kudos
Latest Reply
Arun_KumarPT
New Contributor II
  • 1 kudos

It is well documented here : http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

  • 1 kudos
1 More Replies
AnandJ_Kadhi
by New Contributor II
  • 7158 Views
  • 2 replies
  • 1 kudos

Handle comma inside cell of CSV

We are using spark-csv_2.10 > version 1.5.0 and reading the csv file column which contains comma " , " as one of the character. The problem we are facing is like that it treats the rest of line after the comma as new column and data is not interpre...

  • 7158 Views
  • 2 replies
  • 1 kudos
Latest Reply
User16857282152
Contributor
  • 1 kudos

Take a look here for options, http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv If a csv file has commas then the tradition is to quote the string that contains the comma, In ...

  • 1 kudos
1 More Replies
SwapanSwapandee
by New Contributor II
  • 9180 Views
  • 2 replies
  • 0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

  • 9180 Views
  • 2 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

  • 0 kudos
1 More Replies
CaioIshizaka_Co
by New Contributor
  • 6031 Views
  • 1 replies
  • 0 kudos

Making HTTP post requests on Spark using foreachPartition

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks) I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I reparti...

  • 6031 Views
  • 1 replies
  • 0 kudos
Latest Reply
melo08
New Contributor II
  • 0 kudos

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks) I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repa...

  • 0 kudos
rba76
by New Contributor
  • 21230 Views
  • 2 replies
  • 0 kudos

Python spark.read.text Path does not exist

Dear all, I want to read files with python from a storage account. I followed this instruction https://docs.microsoft.com/en-us/azure/azure-databricks/store-secrets-azure-key-vault. This is my python code: dbutils.fs.mount(source = "wasbs://contain...

  • 21230 Views
  • 2 replies
  • 0 kudos
Latest Reply
PRADEEPCHEEKATL
New Contributor II
  • 0 kudos

@rba76​  Make sure helloworld.txt file exists in the container1 folderI'm able to view the text file using the same commands as follows:Mount Blob Storage:dbutils.fs.mount( source = "wasbs://sampledata@azure.blob.core.windows.net/Azure", mount_po...

  • 0 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels