Hello,
I am using querying my Delta Lake with SQL Connect and later want to explore the result in pandas.
with connection.cursor() as cursor:
cur = cursor.execute("""
SELECT DISTINCT sample_timestamp, value, name
FROM default.raw_delta
WHERE name in ( 'sensor-1', 'sensor-2, 'sensor-3','sensor-4')
AND date >= 20200601
AND date <= 20200731
ORDER BY name, sample_timestamp
""")
df = pd.DataFrame.from_records(cur.fetchall(), columns=['sample_timestamp', 'value', 'name'])
display(df)
While the query is really fast ~8s, the conversion to pandas takes almost 2 minutes.
I am running the code in local jupyter and also in databricks notebook, both with same performance.
What might cause the bad performance and is there a way to speed it up?
I also tried fetchall_arrow() but the pandas dimensions got mixed up (rows become columns).
Thanks,
Leon