cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue in Converting Pyspark Dataframe to dictionary

Databricks3
Contributor

I have 3 questions listed below.

Q1. I need to install third party library in Unity Catalog enabled shared cluster. But I am not able to install. It is not accepting dbfs path dbfs:/FileStore/jars/

Q2. I have a requirement to load the data to salesforce from s3 files. I am using simple salesforce library to perform read/write on Salesforce from databricks. As per the documentation we need to provide dictionary data in the write function. When I am trying to convert the pyspark dataframe I am getting the below error.

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("Test_Conv1","testmailconv1@yopmail.com","Olivia","A",'3000000000'),
    ("Test_Conv2","testmailconv2@yopmail.com","Jack","B",4000000000),
    ("Test_Conv3","testmailconv3@yopmail.com","Williams","C",5000000000),
    ("Test_Conv4","testmailconv4@yopmail.com","Jones","D",6000000000),
    ("Test_Conv5","testmailconv5@yopmail.com","Brown",None,9000000000)
  ]
schema = StructType([ \
    StructField("LastName",StringType(),True), \
    StructField("Email",StringType(),True), \
    StructField("FirstName",StringType(),True), \
    StructField("MiddleName", StringType(), True), \
    StructField("Phone", StringType(), True)
  ])
df = spark.createDataFrame(data=data2,schema=schema)
df_contact = df.rdd.map(lambda row: row.asDict()).collect()
sf.bulk.Contact.insert(df_contact,batch_size=20000,use_serial=True)

Error message :

py4j.security.Py4JSecurityException: Method public org.apache.spark.rdd.RDD org.apache.spark.api.java.JavaRDD.rdd() is not whitelisted on class class org.apache.spark.api.java.JavaRDD

Could you please help me to convert the dataframe to the dictionary.

Q3. Even if there is a way to convert the dataframe to dictionary, it could impact the performance for large data set. Is there any way to load the data in Salesforce in a more optimized way.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

1. https://docs.databricks.com/dbfs/unity-catalog.html

To interact with files directly using DBFS, you must have

ANY FILE

permissions granted.

2.can you try one of these methods?

3.depending on the size of the data this will have an impact. But I think the bottleneck will be at the salesforce side.

View solution in original post

4 REPLIES 4

-werners-
Esteemed Contributor III

1. https://docs.databricks.com/dbfs/unity-catalog.html

To interact with files directly using DBFS, you must have

ANY FILE

permissions granted.

2.can you try one of these methods?

3.depending on the size of the data this will have an impact. But I think the bottleneck will be at the salesforce side.

This is not a permission issue. I have uploaded third-party libraries in databricks but databricks cluster is not accepting the jar paths.

-werners-
Esteemed Contributor III

third-party libs are not in dbfs, so it might still be that issue.

Anonymous
Not applicable

Hi @SK ASIF ALI​ 

We haven't heard from you since the last response from @werners (Customer)​ . Kindly share the information with us, and in return, we will provide you with the necessary solution.

Thanks and Regards

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.