<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: JDBC connection timeout on workflow cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12910#M7661</link>
    <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I finally apply the following workaround, adding a persist on the dataframe when fetching the data from DB: &lt;/P&gt;&lt;P&gt;The data volume being small, it works.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    def get_table_from_db(self, table_name) -&amp;gt; pyspark.sql.DataFrame:
        df = self.spark.read.format("jdbc").option("badRecordsPath", "/tmp/badRecordsPath").options(
            **self.kwargs[f"{self.db_name}_options"]) \
            .option("dbtable", table_name).load().persist(StorageLevel.DISK_ONLY)
        return df&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 24 Jan 2023 08:38:33 GMT</pubDate>
    <dc:creator>Fred_F</dc:creator>
    <dc:date>2023-01-24T08:38:33Z</dc:date>
    <item>
      <title>JDBC connection timeout on workflow cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12904#M7655</link>
      <description>&lt;P&gt;Hi there,&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I've a batch process configured in a workflow which fails due to a jdbc timeout on a Postgres DB.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I checked the JDBC connection configuration and it seems to work when I query a table and doing a df.show() in the process and  it displays the fetched data. So it seems the issue is not coming from there.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I try a couple of configuration at cluster level, but still the same issue.&lt;/P&gt;&lt;P&gt;The conf I tried:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.master local[*, 4]
&amp;nbsp;
spark.databricks.cluster.profile singleNode
&amp;nbsp;
spark.executor.heartbeatInterval 3600s
&amp;nbsp;
spark.network.timeout 4000s&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Knowing that , in the same process, &lt;U&gt;there is  another connection on a mysql DB which seems to work with no noticeable issue&lt;/U&gt;.&lt;/P&gt;&lt;P&gt;The DB is hosted on a GCP cloud sql and our Databricks platform is on GCP as well.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Tell me if you have some hint at configuration level in databricks, knowing as well that this process runs at this moment on another VM with a local pyspark.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Herafter the stacktrace I get :&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Py4JJavaError: An error occurred while calling o1829.checkpoint.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 33 in stage 58.0 failed 4 times, most recent failure: Lost task 33.3 in stage 58.0 (TID 673) (driver-656749566d-lxcst executor driver): org.postgresql.util.PSQLException: The connection attempt failed.
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:331)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
	at org.postgresql.jdbc.PgConnection.&amp;lt;init&amp;gt;(PgConnection.java:223)
	at org.postgresql.Driver.makeConnection(Driver.java:400)
	at org.postgresql.Driver.connect(Driver.java:259)
	at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
	at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProviderBase.create(ConnectionProvider.scala:102)
	at org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1(JdbcDialects.scala:123)
	at org.apache.spark.sql.jdbc.JdbcDialect.$anonfun$createConnectionFactory$1$adapted(JdbcDialects.scala:119)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:277)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
	at o&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;</description>
      <pubDate>Mon, 09 Jan 2023 14:57:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12904#M7655</guid>
      <dc:creator>Fred_F</dc:creator>
      <dc:date>2023-01-09T14:57:28Z</dc:date>
    </item>
    <item>
      <title>Re: JDBC connection timeout on workflow cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12905#M7656</link>
      <description>&lt;P&gt;Hi @Fred Foucart​&amp;nbsp;:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is it possible to share the code snippet here for better understanding. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;RK&lt;/P&gt;</description>
      <pubDate>Mon, 09 Jan 2023 15:08:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12905#M7656</guid>
      <dc:creator>RKNutalapati</dc:creator>
      <dc:date>2023-01-09T15:08:26Z</dc:date>
    </item>
    <item>
      <title>Re: JDBC connection timeout on workflow cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12906#M7657</link>
      <description>&lt;P&gt;Hi @Rama Krishna N​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I use a class with its method get_table_from_db as follow:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;class ExtractData:
&amp;nbsp;
    def __init__(self, db: Enum, **kwargs):
        self.db_name = db.DB_NAME.value
        self.db = db
        self.kwargs = kwargs
        self.spark = init_spark_session(kwargs["app_name"], **kwargs)
        self.loaded_tables = {}
&amp;nbsp;
    def get_table_from_db(self, table_name):
        spark = init_spark_session(self.kwargs["app_name"], **self.kwargs)
        return spark.read.format("jdbc").options(**self.kwargs[f"{self.db_name}_options"]) \
            .option("dbtable", table_name).load()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;the kwargs contains the connections credentials&lt;/P&gt;&lt;P&gt;the init_spark_session is as follow&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;def init_spark_session(spark_app_name, **kwargs):
    """
    SQL Spark instantiation
    :param spark_app_name:
    :param kwargs:
    :return:
    """
&amp;nbsp;
    called_method = SparkSession.builder
    conf = SparkConf()
    for key, value in kwargs["spark_config"].items():
        conf.set(key.replace('"', ''), value)
    return called_method.appName(spark_app_name).config(conf=conf).enableHiveSupport().getOrCreate()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;the connection arguments:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;mysql_options: {
  host = ${VSA_HOST}
  database = ${VSA_DATABASE}
  url = "jdbc:mysql://"${VSA_options.host}"/"${VSA_options.database}"?zeroDateTimeBehavior=convertToNull"
  driver = "com.mysql.jdbc.Driver"
  user = ${VSA_USER}
  password = ${VSA_PASSWORD}
}
post_options: {
  host = ${HOST}
  port = ${PORT}
  database = ${DATABASE}
  url = "jdbc:postgresql://"${CDB_options.host}":"${CDB_options.port}"/"${CDB_options.database}
  driver = "org.postgresql.Driver"
  user = ${USER}
  password = ${PASSWORD}
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 09 Jan 2023 15:16:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12906#M7657</guid>
      <dc:creator>Fred_F</dc:creator>
      <dc:date>2023-01-09T15:16:33Z</dc:date>
    </item>
    <item>
      <title>Re: JDBC connection timeout on workflow cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12907#M7658</link>
      <description>&lt;P&gt;HI @Fred Foucart​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The above code looks good to me. Can you try with below code as well.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;spark.read\&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.format("jdbc") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("url", f"jdbc:postgresql://{host}/{database}") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("driver", "org.postgresql.Driver") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("user", username) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("password", password) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("dbtable", &amp;lt;TableName&amp;gt;) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("fetchsize", 5000) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.load()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In case the table is huge, you can try with parallel reads.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#Param&amp;nbsp;&lt;/P&gt;&lt;P&gt;table_name = "&amp;lt;Your Table Name&amp;gt;"&lt;/P&gt;&lt;P&gt;partitionColumn = "&amp;lt;Primary Key Numeric Column&amp;gt;"&lt;/P&gt;&lt;P&gt;lowerBound = 1&lt;/P&gt;&lt;P&gt;upperBound = 10000 &amp;lt;Total Row Count&amp;gt; #Our table contains over a Billion rows!!!&lt;/P&gt;&lt;P&gt;fetchsize&amp;nbsp;= 1000&lt;/P&gt;&lt;P&gt;num_partitions = 20&amp;nbsp;# Do some math on how many slices the data should be partitioned&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;and read (Total Records / Partitions) i.e. 10000/20 = 500 rows per thread&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;#Read&amp;nbsp;&lt;/P&gt;&lt;P&gt;source_df = spark.read\&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.format("jdbc") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("url", f"jdbc:postgresql://{host}/{database}") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("driver", "org.postgresql.Driver") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("user", username) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("password", password) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("dbtable", source_table) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("partitionColumn", partitionColumn) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("lowerBound", lowerBound) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("upperBound", upperBound) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("numPartitions", partitions) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.option("fetchsize", fetchsize) \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;.load()&lt;/P&gt;</description>
      <pubDate>Mon, 09 Jan 2023 19:51:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12907#M7658</guid>
      <dc:creator>RKNutalapati</dc:creator>
      <dc:date>2023-01-09T19:51:10Z</dc:date>
    </item>
    <item>
      <title>Re: JDBC connection timeout on workflow cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12909#M7660</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried all this configuration but still the same issue.&lt;/P&gt;&lt;P&gt;We need to check the configuration at clod sql level.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I keep you update on the solution we'll found.&lt;/P&gt;&lt;P&gt;Thks&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jan 2023 07:27:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12909#M7660</guid>
      <dc:creator>Fred_F</dc:creator>
      <dc:date>2023-01-12T07:27:50Z</dc:date>
    </item>
    <item>
      <title>Re: JDBC connection timeout on workflow cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12910#M7661</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I finally apply the following workaround, adding a persist on the dataframe when fetching the data from DB: &lt;/P&gt;&lt;P&gt;The data volume being small, it works.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    def get_table_from_db(self, table_name) -&amp;gt; pyspark.sql.DataFrame:
        df = self.spark.read.format("jdbc").option("badRecordsPath", "/tmp/badRecordsPath").options(
            **self.kwargs[f"{self.db_name}_options"]) \
            .option("dbtable", table_name).load().persist(StorageLevel.DISK_ONLY)
        return df&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jan 2023 08:38:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/jdbc-connection-timeout-on-workflow-cluster/m-p/12910#M7661</guid>
      <dc:creator>Fred_F</dc:creator>
      <dc:date>2023-01-24T08:38:33Z</dc:date>
    </item>
  </channel>
</rss>

