cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

tourist_on_road
by New Contributor
  • 6971 Views
  • 1 replies
  • 0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

  • 6971 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

  • 0 kudos
MikeK_
by New Contributor II
  • 15166 Views
  • 1 replies
  • 0 kudos

Resolved! SQL variables in a notebook

Hi, In an SQL notebook, using this link: https://docs.databricks.com/spark/latest/spark-sql/language-manual/set.html I managed to figure out to set values and how to get the value. SET my_val=10; //saves the value 10 for key my_val SET my_val; //dis...

  • 15166 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Mike K.., you can do this with widgets and getArgument. Here's a small example of what that might look like: https://community.databricks.com/s/feed/0D53f00001HKHZfCAP

  • 0 kudos
kruhly
by New Contributor II
  • 38441 Views
  • 12 replies
  • 0 kudos

Resolved! Is there a better method to join two dataframes and not have a duplicated column?

I would like to keep only one of the columns used to join the dataframes. Using select() after the join does not seem straight forward because the real data may have many columns or the column names may not be known. A simple example belowllist = [(...

  • 38441 Views
  • 12 replies
  • 0 kudos
Latest Reply
TejuNC
New Contributor II
  • 0 kudos

This is an expected behavior. DataFrame.join method is equivalent to SQL join like thisSELECT*FROM a JOIN b ON joinExprsIf you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you c...

  • 0 kudos
11 More Replies
Pierrek20
by New Contributor
  • 16497 Views
  • 2 replies
  • 0 kudos

How to loop over spark dataframe with scala ?

Hello ! I 'm rookie to spark scala, here is my problem : tk's in advance for your help my input dataframe looks like this : index bucket time ap station rssi 0 1 00:00 1 1 -84.0 1 1 00:00 1 3 -67.0 2 1 00:00 1 4 -82.0 3 1 00:00 1 2 -68.0 4 1 00:00...

  • 16497 Views
  • 2 replies
  • 0 kudos
Latest Reply
Eve
New Contributor III
  • 0 kudos

Looping is not always necessary, I always use this foreach method, something like the following: aps.collect().foreach(row => <do something>)

  • 0 kudos
1 More Replies
1stcommander
by New Contributor II
  • 9994 Views
  • 2 replies
  • 0 kudos

Parquet partitionBy - date column to nested folders

Hi, when writing a DataFrame to parquet using partitionBy(<date column>), the resulting folder structure looks like this: root |----------------- day1 |----------------- day2 |----------------- day3 Is it possible to create a structure like to foll...

  • 9994 Views
  • 2 replies
  • 0 kudos
Latest Reply
Saphira
New Contributor II
  • 0 kudos

Hey @1stcommander​ You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. Im not sure what your problem is with the partition pruning. It...

  • 0 kudos
1 More Replies
paourissi
by New Contributor
  • 10882 Views
  • 2 replies
  • 1 kudos

When to persist and when to unpersist RDD in Spark

Lets say i have the following:<code>val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK) val dataset3 = dataset2.map(.....)1) 1)If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist ...

  • 10882 Views
  • 2 replies
  • 1 kudos
Latest Reply
Arun_KumarPT
New Contributor II
  • 1 kudos

It is well documented here : http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

  • 1 kudos
1 More Replies
AnandJ_Kadhi
by New Contributor II
  • 7098 Views
  • 2 replies
  • 1 kudos

Handle comma inside cell of CSV

We are using spark-csv_2.10 > version 1.5.0 and reading the csv file column which contains comma " , " as one of the character. The problem we are facing is like that it treats the rest of line after the comma as new column and data is not interpre...

  • 7098 Views
  • 2 replies
  • 1 kudos
Latest Reply
User16857282152
Contributor
  • 1 kudos

Take a look here for options, http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv If a csv file has commas then the tradition is to quote the string that contains the comma, In ...

  • 1 kudos
1 More Replies
SwapanSwapandee
by New Contributor II
  • 9137 Views
  • 2 replies
  • 0 kudos

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as...

  • 9137 Views
  • 2 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Swapan Swapandeep Marwaha, Can you pass them as a Seq as in below code, keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols")

  • 0 kudos
1 More Replies
CaioIshizaka_Co
by New Contributor
  • 5990 Views
  • 1 replies
  • 0 kudos

Making HTTP post requests on Spark using foreachPartition

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks) I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I reparti...

  • 5990 Views
  • 1 replies
  • 0 kudos
Latest Reply
melo08
New Contributor II
  • 0 kudos

Need some help to understand the behaviour of the below in Spark (using Scala and Databricks) I have some dataframe (reading from S3 if that matters), and would send that data by making HTTP post requests in batches of 1000 (at most). So I repa...

  • 0 kudos
rba76
by New Contributor
  • 21159 Views
  • 2 replies
  • 0 kudos

Python spark.read.text Path does not exist

Dear all, I want to read files with python from a storage account. I followed this instruction https://docs.microsoft.com/en-us/azure/azure-databricks/store-secrets-azure-key-vault. This is my python code: dbutils.fs.mount(source = "wasbs://contain...

  • 21159 Views
  • 2 replies
  • 0 kudos
Latest Reply
PRADEEPCHEEKATL
New Contributor II
  • 0 kudos

@rba76​  Make sure helloworld.txt file exists in the container1 folderI'm able to view the text file using the same commands as follows:Mount Blob Storage:dbutils.fs.mount( source = "wasbs://sampledata@azure.blob.core.windows.net/Azure", mount_po...

  • 0 kudos
1 More Replies
desai_n_3
by New Contributor II
  • 17023 Views
  • 6 replies
  • 0 kudos

Cannot Convert Column to Bool Error - When Converting dataframe column which is in string to date type in python

Hi All, I am trying to convert a dataframe column which is in the format of string to date type format yyyy-MM-DD? I have written a sql query and stored it in dataframe. df3 = sqlContext.sql(sqlString2) df3.withColumn(df3['CalDay'],pd.to_datetime(df...

  • 17023 Views
  • 6 replies
  • 0 kudos
Latest Reply
JoshuaJames
New Contributor II
  • 0 kudos

Registered to post this so forgive the formatting nightmare This is a python databricks script function that allows you to convert from string to datetime or date and utilising coalescefrom pyspark.sql.functions import coalesce, to_date def to_dat...

  • 0 kudos
5 More Replies
dbansal
by New Contributor
  • 15808 Views
  • 1 replies
  • 0 kudos

How can I add jars ("spark.jars") to pyspark notebook?

I want to add a few custom jars to the spark conf. Typically they would be submitted along with the spark-submit command but in Databricks notebook, the spark session is already initialized. So, I want to set the jars in "spark.jars" property in the...

  • 15808 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @dbansal, Install the libraries/jars while initialising the cluster.Please go through the documentation on the same below,https://docs.databricks.com/libraries.html#upload-a-jar-python-egg-or-python-wheel

  • 0 kudos
asher
by New Contributor II
  • 9789 Views
  • 1 replies
  • 0 kudos

List all files in a Blob Container

I am trying to find a way to list all files, and related file sizes, in all folders and all sub folders. I guess these are called blobs, in the Databricks world. Anyway, I can easily list all files, and related file sizes, in one single folder, but ...

  • 9789 Views
  • 1 replies
  • 0 kudos
Latest Reply
asher
New Contributor II
  • 0 kudos

from azure.storage.blob import BlockBlobService block_blob_service = BlockBlobService(account_name='your_acct_name', account_key='your_acct_key') mylist = [] generator = block_blob_service.list_blobs('rawdata') for blob in generator: mylist.append(...

  • 0 kudos
ammobear
by New Contributor III
  • 81433 Views
  • 11 replies
  • 6 kudos

Resolved! How do I get the current cluster id?

I am adding Application Insights telemetry to my Databricks jobs and would like to include the cluster ID of the job run. How can I access the cluster id at run time? The requirement is that my job can programmatically retrieve the cluster id to in...

  • 81433 Views
  • 11 replies
  • 6 kudos
Latest Reply
EricBellet
New Contributor III
  • 6 kudos

I fixed, it should be "'$DB_CLUSTER_ID'"

  • 6 kudos
10 More Replies
LaurentThiebaud
by New Contributor
  • 6852 Views
  • 1 replies
  • 0 kudos

Sort within a groupBy with dataframe

Using Spark DataFrame, eg. myDf .filter(col("timestamp").gt(15000)) .groupBy("groupingKey") .agg(collect_list("aDoubleValue")) I want the collect_list to return the result, but ordered according to "timestamp". i.a. I want the GroupBy results...

  • 6852 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Laurent Thiebaud,Please use the below format to sort within a groupby, import org.apache.spark.sql.functions._ df.groupBy("columnA").agg(sort_array(collect_list("columnB")))

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels