cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

MithuWagh
by New Contributor
  • 9227 Views
  • 1 replies
  • 0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

  • 9227 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

  • 0 kudos
KrisMusial
by New Contributor
  • 7020 Views
  • 2 replies
  • 0 kudos

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success. I minimized the code and reproduced the issue with the following two cells: > case class MyClass(val fld1: Integer, val fld2: Integer) > > val lst1 = sc.paralle...

  • 7020 Views
  • 2 replies
  • 0 kudos
Latest Reply
Guru421421
New Contributor II
  • 0 kudos

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

  • 0 kudos
1 More Replies
NandhaKumar
by New Contributor II
  • 7200 Views
  • 3 replies
  • 0 kudos

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

I have created a databricks in azure. I have created a cluster for python 3. I am creating a job using spark-submit parameters. How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in ...

  • 7200 Views
  • 3 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Nandha Kumar,please go through the below docs to pass python files as job,https://docs.databricks.com/dev-tools/api/latest/jobs.html#sparkpythontask

  • 0 kudos
2 More Replies
cfregly
by Contributor
  • 5341 Views
  • 4 replies
  • 0 kudos
  • 5341 Views
  • 4 replies
  • 0 kudos
Latest Reply
GeethGovindSrin
New Contributor II
  • 0 kudos

@cfregly​ : For DataFrames, you can use the following code for using groupBy without aggregations.Df.groupBy(Df["column_name"]).agg({})

  • 0 kudos
3 More Replies
tourist_on_road
by New Contributor
  • 7019 Views
  • 1 replies
  • 0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

  • 7019 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

  • 0 kudos
MikeK_
by New Contributor II
  • 15199 Views
  • 1 replies
  • 0 kudos

Resolved! SQL variables in a notebook

Hi, In an SQL notebook, using this link: https://docs.databricks.com/spark/latest/spark-sql/language-manual/set.html I managed to figure out to set values and how to get the value. SET my_val=10; //saves the value 10 for key my_val SET my_val; //dis...

  • 15199 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Mike K.., you can do this with widgets and getArgument. Here's a small example of what that might look like: https://community.databricks.com/s/feed/0D53f00001HKHZfCAP

  • 0 kudos
kruhly
by New Contributor II
  • 38661 Views
  • 12 replies
  • 0 kudos

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example belowllist = [(...

  • 38661 Views
  • 12 replies
  • 0 kudos
Latest Reply
TejuNC
New Contributor II
  • 0 kudos

This is an expected behavior. DataFrame.join method is equivalent to SQL join like thisSELECT*FROM a JOIN b ON joinExprsIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you c...

  • 0 kudos
11 More Replies
Pierrek20
by New Contributor
  • 16599 Views
  • 2 replies
  • 0 kudos

How to loop over spark dataframe with scala ?

Hello ! I 'm rookie to spark scala, here is my problem : tk's in advance for your help my input dataframe looks like this : index bucket time ap station rssi 0 1 00:00 1 1 -84.0 1 1 00:00 1 3 -67.0 2 1 00:00 1 4 -82.0 3 1 00:00 1 2 -68.0 4 1 00:00...

  • 16599 Views
  • 2 replies
  • 0 kudos
Latest Reply
Eve
New Contributor III
  • 0 kudos

Looping is not always necessary, I always use this foreach method, something like the following: aps.collect().foreach(row => <do something>)

  • 0 kudos
1 More Replies
1stcommander
by New Contributor II
  • 10090 Views
  • 2 replies
  • 0 kudos

Parquet partitionBy - date column to nested folders

Hi, when writing a DataFrame to parquet using partitionBy(<date column>), the resulting folder structure looks like this: root |----------------- day1 |----------------- day2 |----------------- day3 Is it possible to create a structure like to foll...

  • 10090 Views
  • 2 replies
  • 0 kudos
Latest Reply
Saphira
New Contributor II
  • 0 kudos

Hey @1stcommander​ You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. Im not sure what your problem is with the partition pruning. It...

  • 0 kudos
1 More Replies
paourissi
by New Contributor
  • 10944 Views
  • 2 replies
  • 1 kudos

When to persist and when to unpersist RDD in Spark

Lets say i have the following:<code>val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) val dataset3 = dataset2.map(.....)1) 1)If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist ...

  • 10944 Views
  • 2 replies
  • 1 kudos
Latest Reply
Arun_KumarPT
New Contributor II
  • 1 kudos

It is well documented here : http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

  • 1 kudos
1 More Replies
AnandJ_Kadhi
by New Contributor II
  • 7158 Views
  • 2 replies
  • 1 kudos

Handle comma inside cell of CSV

We are using spark-csv_2.10 > version 1.5.0 and reading the csv file column which contains comma " , " as one of the character. The problem we are facing is like that it treats the rest of line after the comma as new column and data is not interpre...

  • 7158 Views
  • 2 replies
  • 1 kudos
Latest Reply
User16857282152
Contributor
  • 1 kudos

Take a look here for options, http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv If a csv file has commas then the tradition is to quote the string that contains the comma, In ...

  • 1 kudos
1 More Replies
SwapanSwapandee
by New Contributor II
  • 9180 Views
  • 2 replies
  • 0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

  • 9180 Views
  • 2 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

  • 0 kudos
1 More Replies
CaioIshizaka_Co
by New Contributor
  • 6031 Views
  • 1 replies
  • 0 kudos

Making HTTP post requests on Spark using foreachPartition

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks) I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I reparti...

  • 6031 Views
  • 1 replies
  • 0 kudos
Latest Reply
melo08
New Contributor II
  • 0 kudos

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks) I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repa...

  • 0 kudos
rba76
by New Contributor
  • 21229 Views
  • 2 replies
  • 0 kudos

Python spark.read.text Path does not exist

Dear all, I want to read files with python from a storage account. I followed this instruction https://docs.microsoft.com/en-us/azure/azure-databricks/store-secrets-azure-key-vault. This is my python code: dbutils.fs.mount(source = "wasbs://contain...

  • 21229 Views
  • 2 replies
  • 0 kudos
Latest Reply
PRADEEPCHEEKATL
New Contributor II
  • 0 kudos

@rba76​  Make sure helloworld.txt file exists in the container1 folderI'm able to view the text file using the same commands as follows:Mount Blob Storage:dbutils.fs.mount( source = "wasbs://sampledata@azure.blob.core.windows.net/Azure", mount_po...

  • 0 kudos
1 More Replies
desai_n_3
by New Contributor II
  • 17153 Views
  • 6 replies
  • 0 kudos

Cannot Convert Column to Bool Error - When Converting dataframe column which is in string to date type in python

Hi All, I am trying to convert a dataframe column which is in the format of string to date type format yyyy-MM-DD? I have written a sql query and stored it in dataframe. df3 = sqlContext.sql(sqlString2) df3.withColumn(df3['CalDay'],pd.to_datetime(df...

  • 17153 Views
  • 6 replies
  • 0 kudos
Latest Reply
JoshuaJames
New Contributor II
  • 0 kudos

Registered to post this so forgive the formatting nightmare This is a python databricks script function that allows you to convert from string to datetime or date and utilising coalescefrom pyspark.sql.functions import coalesce, to_date def to_dat...

  • 0 kudos
5 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels