SQL connector from databricks-sql-connector takes too much time to convert to pandas

leon — Sun, 10 Jul 2022 09:11:01 GMT

Hello,

I am using querying my Delta Lake with SQL Connect and later want to explore the result in pandas.

with connection.cursor() as cursor:
        cur = cursor.execute("""
            SELECT DISTINCT sample_timestamp, value, name
            FROM default.raw_delta
            WHERE name in ( 'sensor-1', 'sensor-2, 'sensor-3','sensor-4')
             AND date >= 20200601
             AND date <= 20200731
            ORDER BY name, sample_timestamp
        """)
df = pd.DataFrame.from_records(cur.fetchall(), columns=['sample_timestamp', 'value', 'name'])
display(df)

While the query is really fast ~8s, the conversion to pandas takes almost 2 minutes.

I am running the code in local jupyter and also in databricks notebook, both with same performance.

What might cause the bad performance and is there a way to speed it up?

I also tried fetchall_arrow() but the pandas dimensions got mixed up (rows become columns).

Thanks,

Leon

Re: SQL connector from databricks-sql-connector takes too much time to convert to pandas

leon — Sun, 17 Jul 2022 08:28:59 GMT

thanks @Kaniz Fatma for the reply.

I am using sql.connector and do believe that the spark session is underline, is this config still relevant for sql.connector?

I overcome the fetchall_arrow() issue from my original question and do believe that I am using arrow implicitly now.

However, I don't see much improvements from fetchall and fetchall_arrow

topic SQL connector from databricks-sql-connector takes too much time to convert to pandas in Data Engineering

SQL connector from databricks-sql-connector takes too much time to convert to pandas

Re: SQL connector from databricks-sql-connector takes too much time to convert to pandas