cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

how read through jdbc from postgres to databricks a particular data type

jeremy98
Contributor III

Hi Community,

I need to load data from PostgreSQL into Databricks through JDBC without changing the data type of a VARCHAR[]column in PostgreSQL, which should remain as an array of strings in Databricks.

Previously, I used psycopg2, and it worked, but I had to first convert the data in Pandas and then in Spark before using it in Databricks. I’d like to switch to JDBC for a more direct approach.

However, when I load the data, the VARCHAR[] column appears as a string in the format:
"{LIKE, COME, "LIKE_COME"}
Similarly, arrays of BIGINT appear as:
"{45,46}".

I understand that PostgreSQL array types are not natively supported in Databricks, but what steps should I take to retain the same data type after loading via JDBC?

Any guidance would be appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions

jeremy98
Contributor III

Hi community,

Yesterday, I found a solution. This is to query through jdbc from postgres creating two columns that are manageable in databricks. Here the code:

   query = f"""(SELECT *, array_to_string(columns_to_export, ',') AS columns_to_export_string, 
      array_to_string(ups, ',') AS ups_to_string from {pg_table_name}) as data_to_fetch"""
    postgres_df = spark.read.format("jdbc") \
      .option("url", connection_properties["jdbc_url"]) \
      .option("dbtable", query) \
      .option("user", connection_properties["db_username"]) \
      .option("password", connection_properties["db_password"]) \
      .option("fetchsize", FETCH_SIZE) \
      .load() \
      .withColumn("columns_to_export", split(col("columns_to_export_string"), ",")) \
      .withColumn("ups", split(col("ups_to_string"), ",").cast("array<bigint>")) \
      .drop("columns_to_export_string", "ups_to_string")

View solution in original post

1 REPLY 1

jeremy98
Contributor III

Hi community,

Yesterday, I found a solution. This is to query through jdbc from postgres creating two columns that are manageable in databricks. Here the code:

   query = f"""(SELECT *, array_to_string(columns_to_export, ',') AS columns_to_export_string, 
      array_to_string(ups, ',') AS ups_to_string from {pg_table_name}) as data_to_fetch"""
    postgres_df = spark.read.format("jdbc") \
      .option("url", connection_properties["jdbc_url"]) \
      .option("dbtable", query) \
      .option("user", connection_properties["db_username"]) \
      .option("password", connection_properties["db_password"]) \
      .option("fetchsize", FETCH_SIZE) \
      .load() \
      .withColumn("columns_to_export", split(col("columns_to_export_string"), ",")) \
      .withColumn("ups", split(col("ups_to_string"), ",").cast("array<bigint>")) \
      .drop("columns_to_export_string", "ups_to_string")

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group