10-13-2023 01:24 AM - edited 10-13-2023 01:31 AM
public void bigDataTest() throws Exception {
int rowsCount = 100_000;
int colSize = 1024;
int colCount = 12;
String colValue = "'"+"x".repeat(colSize)+"'";
String query = "select explode(sequence(1, "+rowsCount+"))," +
String.join(",", Collections.nCopies(colCount, colValue));
try (
Connection conn = dataSource.getConnection()
) {
PreparedStatement ps = conn.prepareStatement(query, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
ps.setFetchSize(1);
ResultSet rs = ps.executeQuery();
int count = 0;
while(rs.next()) {
if(count++ % 100 == 0) {
LOG.info("Count = {}", count);
}
}
}
}
With -Xmx200m I can read about 50_000 rows and after that I receive "Exception in thread "pool-12-thread-50" Exception in thread "pool-12-thread-1" java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space"
The memory picture is classic for OOM:
What can I see in the heapdump:
10-13-2023 01:27 AM - edited 10-13-2023 01:30 AM
DATABRICKS_JDBC_URL = "jdbc:databricks://xxx.cloud.databricks.com:443/default;" +
"transportMode=http;" +
"ssl=1;" +
"httpPath=sql/protocolv1/o/xxxxx;AuthMech=3;MaxConsecutiveResultFileDownloadRetries=50;fetchsize=1"
Without custom MaxConsecutiveResultFileDownloadRetries I received 500638 JDBC error and can read only about 20_000 rows
databricksDriver = "com.databricks:databricks-jdbc:2.6.33"
10-13-2023 02:28 AM
I'd first ingest the raw data onto a data lake (using some ingest tool, databricks is not the best for this imo), then process the data using databricks.
10-13-2023 03:13 AM
Perhaps for Some use cases this will be the solution.
But it does not cancel the fact that there is a memory leak bug in the driver.
10-13-2023 05:10 AM
not necessarily a memory leak. possibly the raw data is fetched and the query is processed in memory. don't know if that is the case though.
10-16-2023 12:28 AM
Ok, let's call it a temporary minor memory starvation issue causing the virtual machine to crash.
10-16-2023 12:31 AM
And here's another extremely minor issue leading to uncontrolled reproduction of threads. https://community.databricks.com/t5/data-engineering/thread-leakage-when-connection-cannot-be-establ...
For some reason nobody responds to it either....
10-16-2023 02:19 AM
That is, at least I think, because the jdbc driver is not part of the databricks platform itself (and closed source afaik).
Chances are small that someone of the community knows the ins an outs of the driver-code.
Now, if you are convinced that there is an actual bug in the databricks driver, I suggest you open a ticket at databricks so someone can look into it.
Because maybe you stumbled upon something here.
10-17-2023 06:05 AM
I solved this issue, but it requires to change several classes. The final result:
10-17-2023 10:36 PM
Nice!
You might wanna share your improvements with the driver devs.
10-17-2023 11:35 PM
Yes, I really want to, but I have absolutely no idea how to send these edits to them.
They do not have a public repository or public ticket system.
10-18-2023 12:12 AM
@Retired_modany idea?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group