cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Leo_138525
by New Contributor II
  • 3168 Views
  • 4 replies
  • 1 kudos

Resolved! RDD not picking up spark configuration for azure storage account access

I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD...

  • 3168 Views
  • 4 replies
  • 1 kudos
Latest Reply
Leo_138525
New Contributor II
  • 1 kudos

I decided to load the files into a DataFrame with a single column and then do the processing before splitting it into separate columns and this works just fine.@Hyper Guy​ thanks for the link, I didn't try that but it seems like it would resolve the ...

  • 1 kudos
3 More Replies
Vadim1
by New Contributor III
  • 3486 Views
  • 3 replies
  • 3 kudos

Resolved! Error on Azure-Databricks write RDD to storage account with wsabs://

Hi, I'm trying to write data from RDD to the storage account:Adding storage account key:spark.conf.set("fs.azure.account.key.y.blob.core.windows.net", "myStorageAccountKey")Read and write to the same storage:val path = "wasbs://x@y.blob.core.windows....

  • 3486 Views
  • 3 replies
  • 3 kudos
Latest Reply
TheoDeSo
New Contributor III
  • 3 kudos

Hello @Vadim1 and @User16764241763. I'm wondering if you find a way to avoid adding the hardcoded key in the advanced options spark config section in the cluster configuration. Is there a similar command to spark.conf.set("spark.hadoop.fs.azure.accou...

  • 3 kudos
2 More Replies
Kash
by Contributor III
  • 7352 Views
  • 3 replies
  • 0 kudos

Linear Regression HELP! Pickle + Broadcast Variable Error

Hi there,I need some help with this example. We're trying to create a linearRegression model that can parallelize for thousands of symbols per date. When we run this we get a picklingError Any suggestions would be much appreciated!KError:PicklingErro...

  • 7352 Views
  • 3 replies
  • 0 kudos
Latest Reply
Kash
Contributor III
  • 0 kudos

@Vidula Khanna​ Can you assist?

  • 0 kudos
2 More Replies
CDICSteph
by New Contributor
  • 2332 Views
  • 2 replies
  • 0 kudos

Need pattern for loading a million small XML files

Hi, looking for the right solution pattern for this scenario: We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta...

  • 2332 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Steph Swierenga​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
danniely
by New Contributor II
  • 12123 Views
  • 1 replies
  • 2 kudos

Pyspark RDD fails with pytest

when I call RDD Apis during pytest, it seems like module "serializer.py" cannot find any other modules under pyspark.I've already looked up on the internet, and it seems like pyspark modules are not properly importing other referring modules.I see ot...

  • 12123 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@hyunho lee​ : It sounds like you are encountering an issue with PySpark's serializer not being able to find the necessary modules during testing with Pytest. One solution you could try is to set the PYTHONPATH environment variable to include the pat...

  • 2 kudos
elgeo
by Valued Contributor II
  • 4390 Views
  • 2 replies
  • 0 kudos

Trasform SQL Cursor using Pyspark in Databricks

We have a Cursor in DB2 which reads in each loop data from 2 tables. At the end of each loop, after inserting the data to a target table, we update records related to each loop in these 2 tables before moving to the next loop. An indicative example i...

  • 4390 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @ELENI GEORGOUSI​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
maartenvr
by New Contributor III
  • 26110 Views
  • 9 replies
  • 2 kudos

Resolved! Unable to clear cache using a pyspark session

Hi all,I am using a persist call on a spark dataframe inside an application to speed-up computations. The dataframe is used throughout my application and at the end of the application I am trying to clear the cache of the whole spark session by calli...

  • 26110 Views
  • 9 replies
  • 2 kudos
Latest Reply
maartenvr
New Contributor III
  • 2 kudos

No solution yet:Hi @Suteja Kanuri​ ,Thank you for thinking along and replying!Unfortunately, I have not found a solution yet.I am getting an error that there exists no ```.getCache()``` method on a spark context. Also note that I have tried to do som...

  • 2 kudos
8 More Replies
gpzz
by New Contributor II
  • 1742 Views
  • 2 replies
  • 1 kudos

MEMORY_ONLY not working

val doubledAmount = premiumCustomers.map(x=>(x._1, x._2*2)).persist(StorageLevel.MEMORY_ONLY) error: not found: value StorageLevel

  • 1742 Views
  • 2 replies
  • 1 kudos
Latest Reply
Chaitanya_Raju
Honored Contributor
  • 1 kudos

Hi @Gaurav Poojary​ ,Can you please try the below as displayed in the image it is working for me without any issues.Happy Learning!!

  • 1 kudos
1 More Replies
gpzz
by New Contributor II
  • 2416 Views
  • 1 replies
  • 3 kudos

pyspark code error

rdd4 = rdd3.reducByKey(lambda x,y: x+y)AttributeError: 'PipelinedRDD' object has no attribute 'reducByKey'Pls help me out with this

  • 2416 Views
  • 1 replies
  • 3 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 3 kudos

Is it a typo or are you really using reducByKey instead of reduceByKey ?

  • 3 kudos
Ovi
by New Contributor III
  • 2847 Views
  • 4 replies
  • 9 kudos

Construct Dataframe or RDD from S3 bucket with Delta tables

Hi all! I have an S3 bucket with Delta parquet files/folders with different schemas each. I need to create an RDD or DataFrame from all those Delta Tables that should contain the path, name and different schema of each.How could I do that?Thank you!P...

  • 2847 Views
  • 4 replies
  • 9 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 9 kudos

You can mount S3 bucket or read directly from it.access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key") secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key") sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", ac...

  • 9 kudos
3 More Replies
Gerhard
by New Contributor III
  • 1465 Views
  • 0 replies
  • 1 kudos

Read proprietary files and transform contents to a table - error resilient process needed

We do have data stored in HDF5 files in a "proprietary" way. This data needs to be read, converted and transformed before it can be inserted into a delta table.All of this transformation is done in a custom python function that takes the HDF5 file an...

  • 1465 Views
  • 0 replies
  • 1 kudos
Mradul07
by New Contributor II
  • 838 Views
  • 0 replies
  • 1 kudos

Spark behavior while dealing with Actions & Transformations ?

Hi, My question is - what happens to the initial RDD after the action is performed on it. Does it disappear or stays in the memory or does it needs to be explicitly cached() if we want to use it again.For eg : If I execute this in a sequence :df_outp...

  • 838 Views
  • 0 replies
  • 1 kudos
Matt101122
by Contributor
  • 2147 Views
  • 1 replies
  • 1 kudos

Resolved! why aren't rdds using all available cores of executor?

I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and ot...

image image
  • 2147 Views
  • 1 replies
  • 1 kudos
Latest Reply
Matt101122
Contributor
  • 1 kudos

I may have figured this out! I'm explicitly setting the number of slices instead of using the default.days_rdd = sc.parallelize(days_to_process,len(days_to_process))

  • 1 kudos
mick042
by New Contributor III
  • 1164 Views
  • 1 replies
  • 0 kudos

Optimal approach when using external script/executable for processing data

I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout. I am quite new to spark. What I am attempting is to use rdd.pipe as in the followingexe_path = " /usr/local/bin/external...

  • 1164 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16753725469
Contributor II
  • 0 kudos

Hi @Michael Lennon​  Can you please elaborate use case on what the external app is doing exe_path

  • 0 kudos
KateK
by New Contributor II
  • 2053 Views
  • 2 replies
  • 1 kudos

How do you correctly access the spark context in DLT pipelines?

I have some code that uses RDDs, and the sc.parallelize() and rdd.toDF() methods to get a dataframe back out. The code works in a regular notebook (and if I run the notebook as a job) but fails if I do the same thing in a DLT pipeline. The error mess...

  • 2053 Views
  • 2 replies
  • 1 kudos
Latest Reply
KateK
New Contributor II
  • 1 kudos

Thanks for your help Alex, I ended up re-writing my code with spark UDFs -- maybe there is a better solution with only the Dataframe API but I couldn't find it. To summarize my problem: I was trying to un-nest a large json blob (the fake data in my f...

  • 1 kudos
1 More Replies
Labels