OOM while loading a lot of data through JDBC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-13-2023 01:24 AM - edited 10-13-2023 01:31 AM
public void bigDataTest() throws Exception {
int rowsCount = 100_000;
int colSize = 1024;
int colCount = 12;
String colValue = "'"+"x".repeat(colSize)+"'";
String query = "select explode(sequence(1, "+rowsCount+"))," +
String.join(",", Collections.nCopies(colCount, colValue));
try (
Connection conn = dataSource.getConnection()
) {
PreparedStatement ps = conn.prepareStatement(query, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
ps.setFetchSize(1);
ResultSet rs = ps.executeQuery();
int count = 0;
while(rs.next()) {
if(count++ % 100 == 0) {
LOG.info("Count = {}", count);
}
}
}
}
With -Xmx200m I can read about 50_000 rows and after that I receive "Exception in thread "pool-12-thread-50" Exception in thread "pool-12-thread-1" java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space"
The memory picture is classic for OOM:
What can I see in the heapdump:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-13-2023 01:27 AM - edited 10-13-2023 01:30 AM
DATABRICKS_JDBC_URL = "jdbc:databricks://xxx.cloud.databricks.com:443/default;" +
"transportMode=http;" +
"ssl=1;" +
"httpPath=sql/protocolv1/o/xxxxx;AuthMech=3;MaxConsecutiveResultFileDownloadRetries=50;fetchsize=1"
Without custom MaxConsecutiveResultFileDownloadRetries I received 500638 JDBC error and can read only about 20_000 rows
databricksDriver = "com.databricks:databricks-jdbc:2.6.33"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-13-2023 02:28 AM
I'd first ingest the raw data onto a data lake (using some ingest tool, databricks is not the best for this imo), then process the data using databricks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-13-2023 03:13 AM
Perhaps for Some use cases this will be the solution.
But it does not cancel the fact that there is a memory leak bug in the driver.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-13-2023 05:10 AM
not necessarily a memory leak. possibly the raw data is fetched and the query is processed in memory. don't know if that is the case though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-16-2023 12:28 AM
Ok, let's call it a temporary minor memory starvation issue causing the virtual machine to crash.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-16-2023 12:31 AM
And here's another extremely minor issue leading to uncontrolled reproduction of threads. https://community.databricks.com/t5/data-engineering/thread-leakage-when-connection-cannot-be-establ...
For some reason nobody responds to it either....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-16-2023 02:19 AM
That is, at least I think, because the jdbc driver is not part of the databricks platform itself (and closed source afaik).
Chances are small that someone of the community knows the ins an outs of the driver-code.
Now, if you are convinced that there is an actual bug in the databricks driver, I suggest you open a ticket at databricks so someone can look into it.
Because maybe you stumbled upon something here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-17-2023 06:05 AM
I solved this issue, but it requires to change several classes. The final result:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-17-2023 10:36 PM
Nice!
You might wanna share your improvements with the driver devs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-17-2023 11:35 PM
Yes, I really want to, but I have absolutely no idea how to send these edits to them.
They do not have a public repository or public ticket system.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-18-2023 12:12 AM
@Retired_modany idea?