Data Engineering

Forum Posts

Sorted by:

by Leo_138525 • New Contributor II

09-15-2022 12:38:20 AM

3410 Views
4 replies
1 kudos

Resolved! RDD not picking up spark configuration for azure storage account access

I want to open some CSV files as an RDD, do some processing and then load it as a DataFrame. Since the files are stored in an Azure blob storage account I need to configure the access accordingly, which for some reason does not work when using an RDD...

Data Engineering

3410 Views
4 replies
1 kudos

09-15-2022 12:38:20 AM

View Replies

Latest Reply

Leo_138525
New Contributor II

09-28-2022 12:27:23 AM

1 kudos

I decided to load the files into a DataFrame with a single column and then do the processing before splitting it into separate columns and this works just fine.@Hyper Guy thanks for the link, I didn't try that but it seems like it would resolve the ...

1 kudos

09-28-2022 12:27:23 AM

3 More Replies

by Vadim1 • New Contributor III

06-03-2022 6:46:50 AM

3664 Views
3 replies
3 kudos

Resolved! Error on Azure-Databricks write RDD to storage account with wsabs://

Hi, I'm trying to write data from RDD to the storage account:Adding storage account key:spark.conf.set("fs.azure.account.key.y.blob.core.windows.net", "myStorageAccountKey")Read and write to the same storage:val path = "wasbs://x@y.blob.core.windows....

Data Engineering

3664 Views
3 replies
3 kudos

06-03-2022 6:46:50 AM

View Replies

Latest Reply

TheoDeSo
New Contributor III

07-11-2023 1:11:55 AM

3 kudos

Hello @Vadim1 and @User16764241763. I'm wondering if you find a way to avoid adding the hardcoded key in the advanced options spark config section in the cluster configuration. Is there a similar command to spark.conf.set("spark.hadoop.fs.azure.accou...

3 kudos

07-11-2023 1:11:55 AM

2 More Replies

by Kash • Contributor III

06-20-2023 9:47:35 AM

7546 Views
3 replies
0 kudos

Linear Regression HELP! Pickle + Broadcast Variable Error

Hi there,I need some help with this example. We're trying to create a linearRegression model that can parallelize for thousands of symbols per date. When we run this we get a picklingError Any suggestions would be much appreciated!KError:PicklingErro...

Data Engineering

7546 Views
3 replies
0 kudos

06-20-2023 9:47:35 AM

View Replies

Latest Reply

Kash
Contributor III

06-22-2023 7:45:02 AM

0 kudos

@Vidula Khanna Can you assist?

0 kudos

06-22-2023 7:45:02 AM

2 More Replies

by CDICSteph • New Contributor

04-28-2023 9:29:59 AM

2505 Views
2 replies
0 kudos

Need pattern for loading a million small XML files

Hi, looking for the right solution pattern for this scenario: We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta...

Data Engineering

2505 Views
2 replies
0 kudos

04-28-2023 9:29:59 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-29-2023 12:20:18 AM

0 kudos

Hi @Steph Swierenga Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

0 kudos

04-29-2023 12:20:18 AM

1 More Replies

by danniely • New Contributor II

01-31-2023 7:25:36 AM

12242 Views
1 replies
2 kudos

Pyspark RDD fails with pytest

when I call RDD Apis during pytest, it seems like module "serializer.py" cannot find any other modules under pyspark.I've already looked up on the internet, and it seems like pyspark modules are not properly importing other referring modules.I see ot...

Data Engineering

12242 Views
1 replies
2 kudos

01-31-2023 7:25:36 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 7:09:42 AM

2 kudos

@hyunho lee : It sounds like you are encountering an issue with PySpark's serializer not being able to find the necessary modules during testing with Pytest. One solution you could try is to set the PYTHONPATH environment variable to include the pat...

2 kudos

04-10-2023 7:09:42 AM

by elgeo • Valued Contributor II

02-13-2023 5:07:31 AM

4833 Views
2 replies
0 kudos

Trasform SQL Cursor using Pyspark in Databricks

We have a Cursor in DB2 which reads in each loop data from 2 tables. At the end of each loop, after inserting the data to a target table, we update records related to each loop in these 2 tables before moving to the next loop. An indicative example i...

Data Engineering

4833 Views
2 replies
0 kudos

02-13-2023 5:07:31 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 3:11:21 AM

0 kudos

Hi @ELENI GEORGOUSI Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

0 kudos

04-10-2023 3:11:21 AM

1 More Replies

by maartenvr • New Contributor III

02-28-2023 5:06:06 AM

29582 Views
9 replies
2 kudos

Resolved! Unable to clear cache using a pyspark session

Hi all,I am using a persist call on a spark dataframe inside an application to speed-up computations. The dataframe is used throughout my application and at the end of the application I am trying to clear the cache of the whole spark session by calli...

Data Engineering

29582 Views
9 replies
2 kudos

02-28-2023 5:06:06 AM

View Replies

Latest Reply

maartenvr
New Contributor III

03-13-2023 2:52:53 AM

2 kudos

No solution yet:Hi @Suteja Kanuri ,Thank you for thinking along and replying!Unfortunately, I have not found a solution yet.I am getting an error that there exists no ```.getCache()``` method on a spark context. Also note that I have tried to do som...

2 kudos

03-13-2023 2:52:53 AM

8 More Replies

by gpzz • New Contributor II

01-07-2023 10:25:09 PM

1918 Views
2 replies
1 kudos

MEMORY_ONLY not working

val doubledAmount = premiumCustomers.map(x=>(x._1, x._2*2)).persist(StorageLevel.MEMORY_ONLY) error: not found: value StorageLevel

Data Engineering

1918 Views
2 replies
1 kudos

01-07-2023 10:25:09 PM

View Replies

Latest Reply

Chaitanya_Raju
Honored Contributor

01-07-2023 11:50:06 PM

1 kudos

Hi @Gaurav Poojary ,Can you please try the below as displayed in the image it is working for me without any issues.Happy Learning!!

1 kudos

01-07-2023 11:50:06 PM

1 More Replies

by gpzz • New Contributor II

12-12-2022 9:24:57 PM

2625 Views
1 replies
3 kudos

pyspark code error

rdd4 = rdd3.reducByKey(lambda x,y: x+y)AttributeError: 'PipelinedRDD' object has no attribute 'reducByKey'Pls help me out with this

Data Engineering

2625 Views
1 replies
3 kudos

12-12-2022 9:24:57 PM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

12-12-2022 10:44:40 PM

3 kudos

Is it a typo or are you really using reducByKey instead of reduceByKey ?

3 kudos

12-12-2022 10:44:40 PM

by Ovi • New Contributor III

10-18-2022 9:31:21 AM

3162 Views
4 replies
9 kudos

Construct Dataframe or RDD from S3 bucket with Delta tables

Hi all! I have an S3 bucket with Delta parquet files/folders with different schemas each. I need to create an RDD or DataFrame from all those Delta Tables that should contain the path, name and different schema of each.How could I do that?Thank you!P...

Data Engineering

3162 Views
4 replies
9 kudos

10-18-2022 9:31:21 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

12-05-2022 8:38:19 AM

9 kudos

You can mount S3 bucket or read directly from it.access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key") secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key") sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", ac...

9 kudos

12-05-2022 8:38:19 AM

3 More Replies

by Gerhard • New Contributor III

11-25-2022 5:59:42 AM

1604 Views
0 replies
1 kudos

Read proprietary files and transform contents to a table - error resilient process needed

We do have data stored in HDF5 files in a "proprietary" way. This data needs to be read, converted and transformed before it can be inserted into a delta table.All of this transformation is done in a custom python function that takes the HDF5 file an...

Data Engineering

1604 Views
0 replies
1 kudos

11-25-2022 5:59:42 AM

by Mradul07 • New Contributor II

10-27-2022 3:20:05 PM

911 Views
0 replies
1 kudos

Spark behavior while dealing with Actions & Transformations ?

Hi, My question is - what happens to the initial RDD after the action is performed on it. Does it disappear or stays in the memory or does it needs to be explicitly cached() if we want to use it again.For eg : If I execute this in a sequence :df_outp...

Data Engineering

911 Views
0 replies
1 kudos

10-27-2022 3:20:05 PM

by Matt101122 • Contributor

10-11-2022 11:01:20 AM

2357 Views
1 replies
1 kudos

Resolved! why aren't rdds using all available cores of executor?

I'm extracting data from a custom format by day of month using a 32 core executor. I'm using rdds to distribute work across cores of the executor. I'm seeing an intermittent issue where for a run sometimes I see 31 cores being used as expected and ot...

Data Engineering

2357 Views
1 replies
1 kudos

10-11-2022 11:01:20 AM

View Replies

Latest Reply

Matt101122
Contributor

10-13-2022 6:59:15 AM

1 kudos

I may have figured this out! I'm explicitly setting the number of slices instead of using the default.days_rdd = sc.parallelize(days_to_process,len(days_to_process))

1 kudos

10-13-2022 6:59:15 AM

by mick042 • New Contributor III

06-14-2022 6:04:25 AM

1221 Views
1 replies
0 kudos

Optimal approach when using external script/executable for processing data

I need to process a number of files where I manipulate file text utilising an external executable that operates on stdin/stdout. I am quite new to spark. What I am attempting is to use rdd.pipe as in the followingexe_path = " /usr/local/bin/external...

Data Engineering

1221 Views
1 replies
0 kudos

06-14-2022 6:04:25 AM

View Replies

Latest Reply

User16753725469
Contributor II

09-09-2022 8:21:26 AM

0 kudos

Hi @Michael Lennon Can you please elaborate use case on what the external app is doing exe_path

0 kudos

09-09-2022 8:21:26 AM

by KateK • New Contributor II

08-04-2022 9:15:41 AM

2252 Views
2 replies
1 kudos

How do you correctly access the spark context in DLT pipelines?

I have some code that uses RDDs, and the sc.parallelize() and rdd.toDF() methods to get a dataframe back out. The code works in a regular notebook (and if I run the notebook as a job) but fails if I do the same thing in a DLT pipeline. The error mess...

Data Engineering

2252 Views
2 replies
1 kudos

08-04-2022 9:15:41 AM

View Replies

Latest Reply

KateK
New Contributor II

08-08-2022 9:43:57 AM

1 kudos

Thanks for your help Alex, I ended up re-writing my code with spark UDFs -- maybe there is a better solution with only the Dataframe API but I couldn't find it. To summarize my problem: I was trying to un-nest a large json blob (the fake data in my f...

1 kudos

08-08-2022 9:43:57 AM

1 More Replies