cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

OOM while loading a lot of data through JDBC

krocodl
Contributor

 

 

    
    public void bigDataTest() throws Exception {
        int rowsCount = 100_000;
        int colSize = 1024;
        int colCount = 12;

        

        String colValue = "'"+"x".repeat(colSize)+"'";
        String query = "select explode(sequence(1, "+rowsCount+"))," +
                String.join(",", Collections.nCopies(colCount, colValue));

        try (                
                Connection conn = dataSource.getConnection()
        ) {
            PreparedStatement ps = conn.prepareStatement(query, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
            ps.setFetchSize(1);
            ResultSet rs = ps.executeQuery();

            int count = 0;
            while(rs.next()) {
                if(count++ % 100 == 0) {
                    LOG.info("Count = {}", count);
                }
            }
        }
    }

 

 

With -Xmx200m I can read about 50_000 rows and after that I receive "Exception in thread "pool-12-thread-50" Exception in thread "pool-12-thread-1" java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space"

The memory picture is classic for OOM:

Screenshot 2023-10-13 at 08.10.08.png

What can I see in the heapdump:

Screenshot 2023-10-13 at 08.12.52.png

11 REPLIES 11

krocodl
Contributor
DATABRICKS_JDBC_URL = "jdbc:databricks://xxx.cloud.databricks.com:443/default;" +
"transportMode=http;" +
"ssl=1;" +
"httpPath=sql/protocolv1/o/xxxxx;AuthMech=3;MaxConsecutiveResultFileDownloadRetries=50;fetchsize=1"

Without custom MaxConsecutiveResultFileDownloadRetries I received 500638 JDBC error and can read only  about 20_000 rows

databricksDriver = "com.databricks:databricks-jdbc:2.6.33"

-werners-
Esteemed Contributor III

I'd first ingest the raw data onto a data lake (using some ingest tool, databricks is not the best for this imo), then process the data using databricks.

Perhaps for Some use cases this will be the solution.

But it does not cancel the fact that there is a memory leak bug in the driver.

-werners-
Esteemed Contributor III

not necessarily a memory leak.  possibly the raw data is fetched and the query is processed in memory.  don't know if that is the case though.

Ok, let's call it a temporary minor memory starvation issue causing the virtual machine to crash.

krocodl
Contributor

And here's another extremely minor issue leading to uncontrolled reproduction of threads. https://community.databricks.com/t5/data-engineering/thread-leakage-when-connection-cannot-be-establ...
For some reason nobody responds to it either....

-werners-
Esteemed Contributor III

That is, at least I think, because the jdbc driver is not part of the databricks platform itself (and closed source afaik).
Chances are small that someone of the community knows the ins an outs of the driver-code.
Now, if you are convinced that there is an actual bug in the databricks driver, I suggest you open a ticket at databricks so someone can look into it.
Because maybe you stumbled upon something here.

krocodl
Contributor

I solved this issue, but it requires to change several classes. The final result:

Screenshot 2023-10-17 at 09.13.52.png

-werners-
Esteemed Contributor III

Nice!
You might wanna share your improvements with the driver devs.

Yes, I really want to, but I have absolutely no idea how to send these edits to them.

They do not have a public repository or public ticket system.

-werners-
Esteemed Contributor III

@Retired_modany idea?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group