Data Engineering

Forum Posts

Sorted by:

by tanjil • New Contributor III

01-08-2023 9:50:11 PM

3034 Views
4 replies
2 kudos

print(flush = True) not working

Hello, I have the following minimum example working example using multiprocessing:from multiprocessing import Pool files_list = [('bla', 1, 3, 7), ('spam', 12, 4, 8), ('eggs', 17, 1, 3)] def f(t): print('Hello from child process', flush = Tr...

Data Engineering

3034 Views
4 replies
2 kudos

01-08-2023 9:50:11 PM

View Replies

Latest Reply

tanjil
New Contributor III

01-10-2023 1:58:47 AM

2 kudos

No errors are generated. The code executes successfully, but there the print statement for "Hello from child process" does not work.

2 kudos

01-10-2023 1:58:47 AM

3 More Replies

by Volkan_Gumuskay • New Contributor III

09-06-2022 3:35:31 AM

8697 Views
6 replies
3 kudos

Resolved! Is there a way to run a single or selected lines in a notebook?

Assume we have a given cellprint('A') print('B') print('C')I want to run only the below line.print('B')Obviously, I can seperate the cell into three and run the one I want, but this is timely. This is a feature I use so often (e.g. in pycharm) and wo...

Data Engineering

8697 Views
6 replies
3 kudos

09-06-2022 3:35:31 AM

View Replies

Latest Reply

Tharun-Kumar
Databricks Employee

07-12-2023 5:28:56 AM

3 kudos

@Volkan_Gumuskay This is also available as an option in the notebook run options.

3 kudos

07-12-2023 5:28:56 AM

5 More Replies

by shelly • New Contributor

03-28-2023 9:07:50 PM

2954 Views
3 replies
0 kudos

take() operation throwing index out of range error

x=[1,2,3,4,5,6,7]rdd = sc.parallelize(x)print (rdd.take(2))Traceback (most recent call last): File "/usr/local/spark/python/pyspark/serializers.py", line 458, in dumps return cloudpickle.dumps(obj, pickle_protocol) ^^^^^^^^^^^^^^^^^^...

Data Engineering

2954 Views
3 replies
0 kudos

03-28-2023 9:07:50 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-03-2023 11:25:54 PM

0 kudos

Hi @Shelly Bhardwaj Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Th...

0 kudos

04-03-2023 11:25:54 PM

2 More Replies

by Callum • New Contributor II

12-01-2022 7:05:53 AM

12650 Views
3 replies
2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

Data Engineering

12650 Views
3 replies
2 kudos

12-01-2022 7:05:53 AM

View Replies

Latest Reply

Serlal
New Contributor III

01-31-2023 3:01:12 AM

2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

2 kudos

01-31-2023 3:01:12 AM

2 More Replies

by RohitKulkarni • Contributor II

08-23-2022 6:25:00 AM

5215 Views
2 replies
1 kudos

Salesforce to Databricks

Hello Team,I am trying to run the salesforce and try to extract the data.AT that time i am facing the below issue :SOURCE_SYSTEM_NAME = 'Salesforce'TABLE_NAME = 'XY'desc = eval("sf." + TABLE_NAME + ".describe()")print(desc)for field in desc['fields']...

Data Engineering

5215 Views
2 replies
1 kudos

08-23-2022 6:25:00 AM

View Replies

Latest Reply

Vidula
Honored Contributor

09-12-2022 6:07:02 AM

1 kudos

Hi @Rohit Kulkarni Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

1 kudos

09-12-2022 6:07:02 AM

1 More Replies

by Data_Bricks1 • New Contributor III

10-13-2021 11:47:18 AM

4183 Views
7 replies
0 kudos

data from 10 BLOB containers and multiple hierarchical folders(every day and every hour folders) in each container to Delta lake table in parquet format - Incremental loading for latest data only insert no updates

I am able to load data for single container by hard coding, but not able to load from multiple containers. I used for loop, but data frame is loading only last container's last folder record only.Here one more issue is I have to flatten data, when I ...

Data Engineering

4183 Views
7 replies
0 kudos

10-13-2021 11:47:18 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-14-2021 3:48:17 AM

0 kudos

for sure function (def) should be declared outside loop, move it after importing libraries,logic is a bit complicated you need to debug it using display(Flatten_df2) (or .show()) and validating json after each iteration (using break or sleep etc.)

0 kudos

10-14-2021 3:48:17 AM

6 More Replies

Databricks Community

print(flush = True) not working

Resolved! Is there a way to run a single or selected lines in a notebook?

take() operation throwing index out of range error

Pyspark Pandas column or index name appears to persist after being dropped or removed.

Salesforce to Databricks

data from 10 BLOB containers and multiple hierarchical folders(every day and every hour folders) in each container to Delta lake table in parquet format - Incremental loading for latest data only insert no updates