Databricks

tarente · ‎01-27-2022

How to save the schema of a csv file in a delta table's column?

In a previous project implemented in Databricks using Scala notebooks, we stored the schema of csv files as a "json string" in a SQL Server table.

When we needed to read or write the csv and the source dataframe das 0 rows, or the source csv does not exist, we use the schema stored in the SQL Server to either create an empty dataframe or empty csv file.

Now, I would like to implement something similar in Databricks but using Python notebook and store the schema of csv files in a delta table.

Any suggestions?

Thanks in advance,

Tiago.

RKNutalapati · ‎02-22-2022

Hi @Tiago Rente , Hope below code would help.

View solution in original post

Hubert-Dudek · ‎01-27-2022

After you read csv to dataframe spark.read.csv ... there are 3 ways

DataFrame.Schema

DataFrame.printSchema() - it is StructType

and 3rd tricky way is DDL string

DataFrame._jdf.schema().toDDL()

Usually DDL as it is simple string is easiest to save somewhere and than reuse. Just insert to some delta table schema and then select when needed.

tarente · ‎02-02-2022

Hi Hubert,

Thanks for you answer, but I was not able to make it work.

Let me ask the question in a different way.

I have a csv file with the following basic estruture:

ProductId - integer.
ProductDesc - string.
ProductCost - decimal.

In PySpark I would like to store the file schema in:

In a variable to be used in the spark.read.schema(schema).options(**fileOptions).schema(schema).load(...).
Be able to store the file schema in a delta table's column.

What kind of transformations do I need to do to the variable in 1. to be able to stored in 2., and vice-versa?

Thanks in advance,

Tiago R.

Kaniz · ‎02-07-2022

Hi @Tiago Rente , Hope this would help.

csv_file= spark.read.csv("/path/to/input/data",header=True,sep=","); 
csv_file.write.format("delta").mode("overwrite").option('overwriteSchema','true').save("/mnt/delta/product") 
spark.sql("CREATE TABLE employee USING DELTA LOCATION '/mnt/delta/product/'")

tarente · ‎02-07-2022

Hi Kaniz,

Thanks for your answer, although it did not answer my questions.

RKNutalapati · ‎02-22-2022

Hi @Tiago Rente , Hope below code would help.

tarente · ‎02-23-2022

Hi,

Thanks for you code, I will test it.

Regards,

Tiago.

Anonymous · ‎03-01-2022

@Tiago Rente - How did the test go?

tarente · ‎03-04-2022

Hi Piper,

Unfortunately, I was not able to test it before I changed to a new employer, so I can no longer test it. However, I think it would work.

Regards,

Tiago R.

Anonymous · ‎03-06-2022

@tarente - Thanks for letting us know. 🙂

Databricks

Pyspark - how to save the schema of a csv file in a delta table's column

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs