<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: slow Fetching results by client in databricks SQL calling from Azure Compute Instance (AML) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13365#M8064</link>
    <description>&lt;P&gt;So I made some few tests. Since you said that the Databricks SQL driver wasn't made to retrieve that amount of data. I went on Spark.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I fired up a small spark cluster, the query was as fast as on SQL Warehouse, then I did a df.write.parquet("/my_path/") This took 10 minutes (2GiB of parquet files)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then I used Azure Storage API to download all parquet files from the folder on Azure Storage. And load it with Pandas.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The original techno I was using was Impala and did this in 30 minutes. SQL Warehouse is 4 hours. With Spark + some spark.write + file download to local/VM/pod + load to pandas is 20 minutes.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 16 Jan 2023 14:48:30 GMT</pubDate>
    <dc:creator>Etyr</dc:creator>
    <dc:date>2023-01-16T14:48:30Z</dc:date>
    <item>
      <title>slow Fetching results by client in databricks SQL calling from Azure Compute Instance (AML)</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13362#M8061</link>
      <description>&lt;P&gt;I'm using `databricks-sql-connector` in python3.8 to connect to an Azure SQL Wharehouse inside an Azure Machine Learning Compute Instance.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have this large result query, looking at the `query history` I check the time spent on doing the query, and sending the data to the client. The result is 27Gb big. It took over 2 hours to get this data on my Azure Compute Instance.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="first_time_query"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/905i382F69853FA30764/image-size/large?v=v2&amp;amp;px=999" role="button" title="first_time_query" alt="first_time_query" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here you can see it actually took 1.88m to "make" the data, and 2.27 hours to send the data.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When I redo this query but inside SQL editor in databricks (and removing the 1000 limit), the `Fetching results by client` is way faster. Because it's in local I guess, which is normal.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So I am assuming that the network is in cause here. But since Databricks and my compute instance are in the same subnet/network, I don't understand why the download is so slow.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have an other hypothesis, it's that the python code freeze, because after running for an hour, using `free -m` in linux shows me that the data is in memory, but when I do CTRL+C on the python code, it wont stop the process and nothing happens. I have to do CTRL+Z to stop the process, but this creates memory leak because checker back `free -m` the memory hasn't decresed.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any idea if databricks-sql-connector has issues when fetching "large" results?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 06 Jan 2023 11:45:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13362#M8061</guid>
      <dc:creator>Etyr</dc:creator>
      <dc:date>2023-01-06T11:45:24Z</dc:date>
    </item>
    <item>
      <title>Re: slow Fetching results by client in databricks SQL calling from Azure Compute Instance (AML)</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13363#M8062</link>
      <description>&lt;P&gt;Are you using fetchall? &lt;/P&gt;&lt;P&gt;&lt;I&gt;"Returns all (or all remaining) rows of the query as a Python list of Row objects."&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I bet that 30GB list will be killer anyway.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please use cursor.fetchmany(10000) and append in chunks to your destination (dataframe or what you are using). Additionally, in that approach, you can monitor the progress of your code.&lt;/P&gt;&lt;P&gt; (for the test, just print the number and time needed for each chunk)&lt;/P&gt;</description>
      <pubDate>Fri, 06 Jan 2023 11:54:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13363#M8062</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2023-01-06T11:54:54Z</dc:date>
    </item>
    <item>
      <title>Re: slow Fetching results by client in databricks SQL calling from Azure Compute Instance (AML)</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13364#M8063</link>
      <description>&lt;P&gt;Yes I was using fetchall()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here is what I get when I use fetchmany(100000) (not 10.000, but 100.000) else there was too much printing:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;&amp;lt;databricks.sql.client.Cursor object at 0x7fddfd233be0&amp;gt;
Fetching chunk of size 100000, 1
1.2840301990509033
Fetching chunk of size 100000, 2
1.7594795227050781
Fetching chunk of size 100000, 3
1.4387767314910889
Fetching chunk of size 100000, 4
1.9465265274047852
Fetching chunk of size 100000, 5
1.284682273864746
Fetching chunk of size 100000, 6
1.8754642009735107
Fetching chunk of size 100000, 7&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;So it takes around 1.5-2 secondes per chunk of size 100.000. From what I understand in the documentation, the chunk size is the number of rows fetched.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Since I have 1008342364 rows, it will take me 10084 fetchmany(100000)&lt;/P&gt;&lt;P&gt;10084 * 1.5 (seconds per fetchmany) = 15125 total seconds to get all the data. so approx 4 hours.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If a do a fetchmany(1000000), i get one fetch of 14 seconds. &lt;/P&gt;&lt;P&gt;1008342364 rows / 1000000 = 1009 fetches&lt;/P&gt;&lt;P&gt;10084 * 14 = 14126 secondes for all fetchs =&amp;gt; 3.9 hours&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 06 Jan 2023 16:30:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13364#M8063</guid>
      <dc:creator>Etyr</dc:creator>
      <dc:date>2023-01-06T16:30:17Z</dc:date>
    </item>
    <item>
      <title>Re: slow Fetching results by client in databricks SQL calling from Azure Compute Instance (AML)</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13365#M8064</link>
      <description>&lt;P&gt;So I made some few tests. Since you said that the Databricks SQL driver wasn't made to retrieve that amount of data. I went on Spark.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I fired up a small spark cluster, the query was as fast as on SQL Warehouse, then I did a df.write.parquet("/my_path/") This took 10 minutes (2GiB of parquet files)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then I used Azure Storage API to download all parquet files from the folder on Azure Storage. And load it with Pandas.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The original techno I was using was Impala and did this in 30 minutes. SQL Warehouse is 4 hours. With Spark + some spark.write + file download to local/VM/pod + load to pandas is 20 minutes.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jan 2023 14:48:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-fetching-results-by-client-in-databricks-sql-calling-from/m-p/13365#M8064</guid>
      <dc:creator>Etyr</dc:creator>
      <dc:date>2023-01-16T14:48:30Z</dc:date>
    </item>
  </channel>
</rss>

