Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16752240150 • New Contributor II

06-04-2021 12:35:28 PM

1405 Views
1 replies
1 kudos

Resolved! If I write pandas code using koalas and have photon enabled, will my pandas code run on photon?

Data Engineering

1405 Views
1 replies
1 kudos

06-04-2021 12:35:28 PM

View Replies

Latest Reply

holly
Databricks Employee

04-08-2024 3:30:16 AM

1 kudos

Hi there! Appreciate this reply is 3 years later than it was originally asked, but people might be coming across it still. A few things: Koalas was deprecated in spark 3.2 (runtime 10.4). Instead, the recommendation is to use pandas on spark with `im...

1 kudos

04-08-2024 3:30:16 AM

by MattPython • New Contributor

02-01-2023 5:20:15 AM

25753 Views
4 replies
0 kudos

How do you read files from the DBFS with OS and Pandas Python libraries?

I created translations for decoded values and want to save the dictionary object the DBFS for mapping. However, I am unable to access the DBFS without using dbutils or PySpark library. Is there a way to access the DBFS with OS and Pandas Python libra...

Data Engineering

25753 Views
4 replies
0 kudos

02-01-2023 5:20:15 AM

View Replies

Latest Reply

User16789202230
Databricks Employee

12-21-2023 2:38:02 AM

0 kudos

db_path = 'file:///Workspace/Users/l<xxxxx>@databricks.com/TITANIC_DEMO/tested.csv' df = spark.read.csv(db_path, header = "True", inferSchema="True")

0 kudos

12-21-2023 2:38:02 AM

3 More Replies

by hamzatazib96 • New Contributor III

08-18-2021 9:11:46 AM

73147 Views
21 replies
12 kudos

Resolved! Read file from dbfs with pd.read_csv() using databricks-connect

Hello all, As described in the title, here's my problem: 1. I'm using databricks-connect in order to send jobs to a databricks cluster 2. The "local" environment is an AWS EC2 3. I want to read a CSV file that is in DBFS (databricks) with pd.read_cs...

Data Engineering

73147 Views
21 replies
12 kudos

08-18-2021 9:11:46 AM

View Replies

Latest Reply

so16
New Contributor II

07-19-2023 1:13:17 PM

12 kudos

Please guys I need your help, I got the same issue still after readed all your comments.I am using Databricks-connect(version 13.1) on pycharm and trying to load file that are on the dbfs storage.spark = DatabricksSession.builder.remote( host=host...

12 kudos

07-19-2023 1:13:17 PM

20 More Replies

by gtyhchang • New Contributor II

05-12-2023 8:55:49 PM

1383 Views
2 replies
1 kudos

pandas issue

We identify a potential bug in either DBFS or Pandas that when writting a dataframe using Pandas `to_csv`, `to_parquet`, `to_pickle` etc to a mounted ADLS location with read-only service principle didn't throw permission deny exceptions. However, met...

Data Engineering

1383 Views
2 replies
1 kudos

05-12-2023 8:55:49 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-21-2023 12:14:34 AM

1 kudos

Hi @Yung-Hang Chang Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Th...

1 kudos

06-21-2023 12:14:34 AM

1 More Replies

by Yash_542965 • New Contributor II

05-16-2023 9:11:19 AM

8240 Views
2 replies
3 kudos

Resolved! Access Excel file in delta live pipeline

I'm having an issue accessing the excel through dlt pipeline. the file is in ADLS I'm using pandas to read the Excel. It seems pandas are not able to understand abfss protocol is there any way to read Excel with pandas in dlt pipeline?I'm getting thi...

Data Engineering

8240 Views
2 replies
3 kudos

05-16-2023 9:11:19 AM

View Replies

Latest Reply

Yash_542965
New Contributor II

06-09-2023 12:16:13 AM

3 kudos

Thanks for the info. It works just need to install an additional library using "%pip install openpyxl".

3 kudos

06-09-2023 12:16:13 AM

1 More Replies

by Vindhya • New Contributor II

04-18-2023 3:41:51 PM

2023 Views
1 replies
0 kudos

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))"

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))", PFB screenshot for more details

Data Engineering

2023 Views
1 replies
0 kudos

04-18-2023 3:41:51 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-23-2023 9:14:00 PM

0 kudos

Hi @Vindhya D Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

0 kudos

04-23-2023 9:14:00 PM

by elgeo • Valued Contributor II

02-21-2023 3:21:41 AM

8641 Views
1 replies
0 kudos

Iteration - Pyspark vs Pandas

Hello. Could someone please explain why iteration over a Pyspark dataframe is way slower than over a Pandas dataframe?Pysparkdf_list = df.collect()for index in range(0, len(df_list )):.....Pandasdf_pnd = df.toPandas() for index, row in df_p...

Data Engineering

8641 Views
1 replies
0 kudos

02-21-2023 3:21:41 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-22-2023 12:11:56 AM

0 kudos

Hi @ELENI GEORGOUSI Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us ...

0 kudos

04-22-2023 12:11:56 AM

by afzi • New Contributor II

08-10-2022 10:40:47 PM

2654 Views
1 replies
1 kudos

Pandas DataFrame error when using to_csv

Hi Everyone, I would like to a Pandas Dataframe to /dbfs/FileStore/ using to_csv method.Usually it would just write the Dataframe to the path described but It has been giving me "FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStor...

Data Engineering

2654 Views
1 replies
1 kudos

08-10-2022 10:40:47 PM

View Replies

Latest Reply

Avinash_94
New Contributor III

04-14-2023 12:31:19 AM

1 kudos

f = open("/dbfs/mnt/blob/myNames.txt", "r")

1 kudos

04-14-2023 12:31:19 AM

by mahesh_vardhan_ • New Contributor

03-02-2023 12:40:23 AM

4582 Views
2 replies
2 kudos

Resolved! How do I use numpy case when condition in pyspark.pandas?

I do have some legacy pandas codes which I want to migrate to spark to leaverage parellelization in Databricks. I see datadricks has launched a wrapper package on top of pandas which uses pandas nomenclature but use spark engine in the backend.I comf...

Data Engineering

4582 Views
2 replies
2 kudos

03-02-2023 12:40:23 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-16-2023 10:30:40 PM

2 kudos

Hi @mahesh vardhan gandhi Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from ...

2 kudos

03-16-2023 10:30:40 PM

1 More Replies

by Callum • New Contributor II

12-01-2022 7:05:53 AM

12512 Views
3 replies
2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

Data Engineering

12512 Views
3 replies
2 kudos

12-01-2022 7:05:53 AM

View Replies

Latest Reply

Serlal
New Contributor III

01-31-2023 3:01:12 AM

2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

2 kudos

01-31-2023 3:01:12 AM

2 More Replies

by Ancil • Contributor II

01-18-2023 10:46:57 AM

1800 Views
1 replies
1 kudos

PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was...

Data Engineering

1800 Views
1 replies
1 kudos

01-18-2023 10:46:57 AM

View Replies

Latest Reply

Ancil
Contributor II

01-22-2023 5:33:17 PM

1 kudos

@Kaniz Fatma Can you please help me on pandas_udf ?Above scenario I have used regular expressions, for that we have our spark method, but I have other pandas_udf have same issue.

1 kudos

01-22-2023 5:33:17 PM

by Ancil • Contributor II

01-17-2023 3:08:23 AM

2991 Views
3 replies
1 kudos

Resolved! PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.

I have pandas_udf, its working for 1 rows, but I tried with more than one rows getting below error.PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output w...

Data Engineering

2991 Views
3 replies
1 kudos

01-17-2023 3:08:23 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-17-2023 4:18:21 AM

1 kudos

I was testing, and your function is correct. So you need to have an error in inputData type (is all string) or with result_json. Please also check the runtime version. I was using 11 LTS.

1 kudos

01-17-2023 4:18:21 AM

2 More Replies

by Mado • Valued Contributor II

10-22-2022 3:38:00 AM

2096 Views
2 replies
3 kudos

How to apply Pandas functions on PySpark DataFrame?

Hi, I want to apply Pandas functions (like isna, concat, append, etc) on PySpark DataFrame in such a way that computations are done on multi-node cluster.I don't want to convert PySpark DataFrame into Pandas DataFrame since, I think, only one node is...

Data Engineering

2096 Views
2 replies
3 kudos

10-22-2022 3:38:00 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-23-2022 2:00:08 PM

3 kudos

The best is to use pandas on a spark, it is virtually interchangeable so it just different API for Spark data frameimport pyspark.pandas as ps psdf = ps.range(10) sdf = psdf.to_spark().filter("id > 5") sdf.show()

3 kudos

10-23-2022 2:00:08 PM

1 More Replies

by Dicer • Valued Contributor

07-02-2022 4:27:46 AM

21993 Views
12 replies
13 kudos

Resolved! Failed to convert Spark.sql to Pandas Dataframe using .toPandas()

I wrote the following code:data = spark.sql (" SELECT A_adjClose, AA_adjClose, AAL_adjClose, AAP_adjClose, AAPL_adjClose FROM deltabase.a_30min_delta, deltabase.aa_30min_delta, deltabase.aal_30min_delta, deltabase.aap_30min_delta ,deltabase.aapl_30m...

Data Engineering

21993 Views
12 replies
13 kudos

07-02-2022 4:27:46 AM

View Replies

Latest Reply

Dicer
Valued Contributor

07-18-2022 11:39:47 PM

13 kudos

I just discovered a solution.Today, I opened Azure Databricks. When I imported python libraries. Databricks told me that toPandas() was deprecated and it suggested me to use toPandas.The following solution works: Use toPandas instead of toPandas() da...

13 kudos

07-18-2022 11:39:47 PM

11 More Replies

by sdaza • New Contributor III

05-29-2018 8:13:21 PM

24362 Views
12 replies
5 kudos

Displaying Pandas Dataframe

I had this issue when displaying pandas data frames. Any ideas on how to display a pandas dataframe? display(mydataframe) Exception: Cannot call display(<class 'pandas.core.frame.DataFrame'>)

Data Engineering

24362 Views
12 replies
5 kudos

05-29-2018 8:13:21 PM

View Replies

Latest Reply

Tim_Green
New Contributor II

06-07-2022 2:13:21 PM

5 kudos

A simple way to get a nicely formatted table from a pandas dataframe:displayHTML(df.to_html())to_html has some parameters you can control the output with. If you want something less basic, try out this code that I wrote that adds scrolling and some ...

5 kudos

06-07-2022 2:13:21 PM

11 More Replies