Topics with Label: Pandas dataframe

Forum Posts

Sorted by:

by MartinB • Contributor III

09-11-2021 3:34:17 AM

11478 Views
5 replies
3 kudos

Resolved! Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future

Hi,I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.Example:Loading this into a Spark dataframe works fine...

Data Engineering

11478 Views
5 replies
3 kudos

09-11-2021 3:34:17 AM

View Replies

Latest Reply

ThePhil
New Contributor II

01-31-2025 2:26:53 PM

3 kudos

Be aware, that in Databricks 15.2 LTS this behavior is broken.I cannot find the code, but most likely related to the following option:https://github.com/apache/spark/commit/c1c710e7da75b989f4d14e84e85f336bc10920e0#diff-f9ddcc6cba651c6ebfd34e29ef049c3...

3 kudos

01-31-2025 2:26:53 PM

4 More Replies

by databicky • Contributor II

01-02-2023 1:08:45 AM

18140 Views
13 replies
4 kudos

How can we write a pandas dataframe into azure adls as excel file, when trying to write it is showing error as protocol not known 'abfss' like that.

Data Engineering

18140 Views
13 replies
4 kudos

01-02-2023 1:08:45 AM

View Replies

Latest Reply

FerArribas
Contributor

01-02-2023 1:25:43 PM

4 kudos

Hi @Hubert Dudek,Pandas API doesn't support abfss protocol.You have three options:If you need to use pandas, you can write the excel to the local file system (dbfs) and then move it to ABFSS (for example with dbutils)Write as csv directly in abfss...

4 kudos

01-02-2023 1:25:43 PM

12 More Replies

by jonathan-dufaul • Valued Contributor

01-06-2023 1:48:40 PM

3933 Views
6 replies
6 kudos

Why is writing to MSSQL Server 12.0 so slow directly from spark but nearly instant when I write to a csv and read it back

I have a dataframe that inexplicably takes forever to write to an MS SQL Server even though other dataframes, even much larger ones, write nearly instantly. I'm using this code:my_dataframe.write.format("jdbc") .option("url",sqlsUrl) .optio...

Data Engineering

3933 Views
6 replies
6 kudos

01-06-2023 1:48:40 PM

View Replies

Latest Reply

plondon
New Contributor II

07-24-2024 4:00:47 AM

6 kudos

Had a similar issue. I can do 1-4 million rows in 1 minute via SSIS ETL on SQL server. Table is 15 fields long. Looking at your code it seems you have many fields but nothing like 300-400 fields which can affect performance. You can check SQL Server ...

6 kudos

07-24-2024 4:00:47 AM

5 More Replies

by kll • New Contributor III

04-19-2023 4:48:03 PM

17863 Views
3 replies
0 kudos

python multiprocessing and the Databricks Architecture - under the hood.

I am curious what is going on under-the-hood when using `multiprocessing` module to parallelize an function call and apply it to a Pandas DataFrame along the row axis. Specifically, how does it work with DataBricks Architecture / Compute. My cluster ...

Data Engineering

17863 Views
3 replies
0 kudos

04-19-2023 4:48:03 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-20-2023 7:23:45 PM

0 kudos

@Keval Shah :When using the multiprocessing module in Python to parallelize a function call and apply it to a Pandas DataFrame along the row axis, the following happens under the hood:The Pool object is created with the specified number of processes...

0 kudos

04-20-2023 7:23:45 PM

2 More Replies

by Zoumana • New Contributor II

11-13-2021 5:22:34 AM

17964 Views
5 replies
6 kudos

Resolved! How to get probability score for each prediction from mlflow

I trained my model and was able to get the batch prediction from that model as specified below. But I want to also get the probability scores for each prediction. Do you have any idea? Thank you!logged_model = path_to_model# Load model as a PyFuncMod...

Data Engineering

17964 Views
5 replies
6 kudos

11-13-2021 5:22:34 AM

View Replies

Latest Reply

OndrejHavlicek
New Contributor III

08-08-2023 1:38:41 AM

6 kudos

Now you can log the model using this parameter:mlflow.sklearn.log_model( ..., # the usual params pyfunc_predict_fn="predict_proba" ) which will return probabilities for the first class apparently when using the model for inference (e.g. when...

6 kudos

08-08-2023 1:38:41 AM

4 More Replies

by Vindhya • New Contributor II

04-18-2023 3:41:51 PM

2049 Views
1 replies
0 kudos

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))"

Dataframes to Pandas conversion step is failing with exception ""java.lang.IndexOutOfBoundsException: index: 16384, length: 4 (expected: range(0, 16384))", PFB screenshot for more details

Data Engineering

2049 Views
1 replies
0 kudos

04-18-2023 3:41:51 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-23-2023 9:14:00 PM

0 kudos

Hi @Vindhya D Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

0 kudos

04-23-2023 9:14:00 PM

by afzi • New Contributor II

08-10-2022 10:40:47 PM

2738 Views
1 replies
1 kudos

Pandas DataFrame error when using to_csv

Hi Everyone, I would like to a Pandas Dataframe to /dbfs/FileStore/ using to_csv method.Usually it would just write the Dataframe to the path described but It has been giving me "FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStor...

Data Engineering

2738 Views
1 replies
1 kudos

08-10-2022 10:40:47 PM

View Replies

Latest Reply

Avinash_94
New Contributor III

04-14-2023 12:31:19 AM

1 kudos

f = open("/dbfs/mnt/blob/myNames.txt", "r")

1 kudos

04-14-2023 12:31:19 AM

by kll • New Contributor III

03-30-2023 10:37:53 AM

6256 Views
1 replies
1 kudos

Resolved! OSError: Invalid argument when attempting to save a pandas dataframe to csv

I am attempting to save a pandas DataFrame to as csv to a directory I created in Databricks workspace or in the `cwd`. import pandas as pd import os df.to_csv("data.csv", index=False) df.to_csv(str(os.getcwd()) + "/data.csv", index=False) ...

Data Engineering

6256 Views
1 replies
1 kudos

03-30-2023 10:37:53 AM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

03-31-2023 4:27:40 AM

1 kudos

Hi @Keval Shah ,You can save your dataframe to csv in dbfs storage.Please refer below code that might help you-df = pd.read_csv(StringIO(data), sep=',') #print(df) df.to_csv('/dbfs/FileStore/ajay/file1.txt')

1 kudos

03-31-2023 4:27:40 AM

by Chhaya • New Contributor III

03-23-2023 12:55:46 AM

2533 Views
3 replies
2 kudos

Bamboolib with Databricks

Hi Everyone,I am wondering if anyone has experience using the bamboolib library within Databricks. I am currently using it for a client to display table data on the UI and potentially allow users to edit existing rows and insert new ones. While I hav...

Data Engineering

2533 Views
3 replies
2 kudos

03-23-2023 12:55:46 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-29-2023 10:40:37 PM

2 kudos

Hi @Chhaya Vishwakarma I'm sorry you could not find a solution to your problem in the answers provided.Our community strives to provide helpful and accurate information, but sometimes an immediate solution may only be available for some issues.I sug...

2 kudos

03-29-2023 10:40:37 PM

2 More Replies

by tinendra • New Contributor III

10-28-2022 6:47:42 AM

4501 Views
7 replies
8 kudos

Can we run pandas dataframe inside databricks?

Hi, I want to run df=pd.read_csv('/dbfs/FileStore/airlines1.csv') while trying to run getting error likeFileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/airlines1.csv'Could you please help me out how to run pandas dataframe in...

Data Engineering

4501 Views
7 replies
8 kudos

10-28-2022 6:47:42 AM

View Replies

Latest Reply

Anonymous
Not applicable

01-08-2023 9:10:36 PM

8 kudos

Hi @Tinendra Kumar Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

8 kudos

01-08-2023 9:10:36 PM

6 More Replies

by Mado • Valued Contributor II

12-20-2022 1:01:42 AM

8793 Views
6 replies
2 kudos

Resolved! How to see if condition is True / False for all rows in a DataFrame?

Assume that I have a Spark DataFrame, and I want to see if records satisfy a condition.Example dataset:# Prepare Data data = [('A', 1), \ ('A', 2), \ ('B', 3) ] # Create DataFrame columns= ['col_1', 'col_2'] df = spark.createDataF...

Data Engineering

8793 Views
6 replies
2 kudos

12-20-2022 1:01:42 AM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

12-20-2022 3:57:33 AM

2 kudos

Hi you can use display() or show() function that will provide you expected results.

2 kudos

12-20-2022 3:57:33 AM

5 More Replies

by resolver101757 • New Contributor

12-16-2022 4:07:13 AM

1149 Views
0 replies
0 kudos

i want to plot multiple data frames from a pandas datafreame

Hi all, i want to plot multiple charts from a pandas datafreame. However, when i run the code below it says "Command result size exceeds limit: Exceeded 20971520 bytes (current = 20973124)". If I move line 11 and place at 21 (outside of the functi...

Data Engineering

1149 Views
0 replies
0 kudos

12-16-2022 4:07:13 AM

by Mado • Valued Contributor II

10-17-2022 3:11:09 PM

8849 Views
4 replies
2 kudos

Resolved! Pandas API on Spark, Does it run on a multi-node cluster?

Hi, I have a few questions about "Pandas API on Spark". Thanks for your time to read my questions1) Input to these functions are Pandas DataFrame or PySpark DataFrame?2) When I use any pandas function (like isna, size, apply, where, etc ), does it ru...

Data Engineering

8849 Views
4 replies
2 kudos

10-17-2022 3:11:09 PM

View Replies

Latest Reply

Debayan
Databricks Employee

10-18-2022 5:46:39 AM

2 kudos

Hi @Mohammad Saber , Pandas dataset lives in the single machine, and is naturally iterable locally within the same machine. However, pandas-on-Spark dataset lives across multiple machines, and they are computed in a distributed manner. It is difficu...

2 kudos

10-18-2022 5:46:39 AM

3 More Replies

by turagittech • New Contributor III

09-01-2022 6:32:02 PM

8172 Views
2 replies
1 kudos

PYODBC very slow - 30 minutes to write 6000 rows

Along withh several other issues I'm encountering, I am finding pandas dataframe to_sql being very slowI am writing to an Azure SQL database and performance is woeful. This is a test database and it has S3 100DTU and one user, me as it's configuratio...

Data Engineering

8172 Views
2 replies
1 kudos

09-01-2022 6:32:02 PM

View Replies

Latest Reply

Vidula
Honored Contributor

09-17-2022 11:07:31 PM

1 kudos

Hi @Peter McLarty Does @Debayan Mukherjee response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!

1 kudos

09-17-2022 11:07:31 PM

1 More Replies

by Rahul_Samant • Contributor

01-19-2022 2:20:31 AM

7154 Views
4 replies
5 kudos

Resolved! High Concurrency Pass Through Cluster : pyarrow optimization not working while converting to pandasdf

i need to convert a spark dataframe to pandas dataframe with arrow optimization spark.conf.set("spark.sql.execution.arrow.enabled", "true")data_df=df.toPandas()but getting one of the below error randomly while doing so Exception: arrow is not support...

Data Engineering

7154 Views
4 replies
5 kudos

01-19-2022 2:20:31 AM

View Replies

Latest Reply

AlexanderBij
New Contributor II

08-09-2022 5:42:26 AM

5 kudos

Can you confirm this is a known issue?Running into same issue, example to test in 1 cell.# using Arrow fails on HighConcurrency-cluster with PassThrough in runtime 10.4 (and 10.5 and 11.0) spark.conf.set("spark.sql.execution.arrow.pyspark.enabled",...

5 kudos

08-09-2022 5:42:26 AM

3 More Replies