Databricks Community

leon · ‎07-10-2022

Hello,

I am using querying my Delta Lake with SQL Connect and later want to explore the result in pandas.

with connection.cursor() as cursor:
        cur = cursor.execute("""
            SELECT DISTINCT sample_timestamp, value, name
            FROM default.raw_delta
            WHERE name in ( 'sensor-1', 'sensor-2, 'sensor-3','sensor-4')
             AND date >= 20200601
             AND date <= 20200731
            ORDER BY name, sample_timestamp
        """)
df = pd.DataFrame.from_records(cur.fetchall(), columns=['sample_timestamp', 'value', 'name'])
display(df)

While the query is really fast ~8s, the conversion to pandas takes almost 2 minutes.

I am running the code in local jupyter and also in databricks notebook, both with same performance.

What might cause the bad performance and is there a way to speed it up?

I also tried fetchall_arrow() but the pandas dimensions got mixed up (rows become columns).

Thanks,

Leon

leon · ‎07-17-2022

thanks @Kaniz Fatma for the reply.

I am using sql.connector and do believe that the spark session is underline, is this config still relevant for sql.connector?

I overcome the fetchall_arrow() issue from my original question and do believe that I am using arrow implicitly now.

However, I don't see much improvements from fetchall and fetchall_arrow

Databricks Community

SQL connector from databricks-sql-connector takes too much time to convert to pandas

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs