cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark - how to save the schema of a csv file in a delta table's column

tarente
New Contributor III

How to save the schema of a csv file in a delta table's column?

In a previous project implemented in Databricks using Scala notebooks, we stored the schema of csv files as a "json string" in a SQL Server table.

When we needed to read or write the csv and the source dataframe das 0 rows, or the source csv does not exist, we use the schema stored in the SQL Server to either create an empty dataframe or empty csv file.

Now, I would like to implement something similar in Databricks but using Python notebook and store the schema of csv files in a delta table.

Any suggestions?

Thanks in advance,

Tiago.

1 ACCEPTED SOLUTION

Accepted Solutions

RKNutalapati
Valued Contributor

Hi @Tiago Rente​ , Hope below code would help.

image

View solution in original post

9 REPLIES 9

Hubert-Dudek
Esteemed Contributor III

After you read csv to dataframe spark.read.csv ... there are 3 ways

DataFrame.Schema

DataFrame.printSchema() - it is StructType

and 3rd tricky way is DDL string

DataFrame._jdf.schema().toDDL()

Usually DDL as it is simple string is easiest to save somewhere and than reuse. Just insert to some delta table schema and then select when needed.

tarente
New Contributor III

Hi Hubert,

Thanks for you answer, but I was not able to make it work.

Let me ask the question in a different way.

I have a csv file with the following basic estruture:

  • ProductId - integer.
  • ProductDesc - string.
  • ProductCost - decimal.

In PySpark I would like to store the file schema in:

  1. In a variable to be used in the spark.read.schema(schema).options(**fileOptions).schema(schema).load(...).
  2. Be able to store the file schema in a delta table's column.

What kind of transformations do I need to do to the variable in 1. to be able to stored in 2., and vice-versa?

Thanks in advance,

Tiago R.

Kaniz
Community Manager
Community Manager

Hi @Tiago Rente​ , Hope this would help.

csv_file= spark.read.csv("/path/to/input/data",header=True,sep=","); 
csv_file.write.format("delta").mode("overwrite").option('overwriteSchema','true').save("/mnt/delta/product") 
spark.sql("CREATE TABLE employee USING DELTA LOCATION '/mnt/delta/product/'")

tarente
New Contributor III

Hi Kaniz,

Thanks for your answer, although it did not answer my questions.

RKNutalapati
Valued Contributor

Hi @Tiago Rente​ , Hope below code would help.

image

tarente
New Contributor III

Hi,

Thanks for you code, I will test it.

Regards,

Tiago.

Anonymous
Not applicable

@Tiago Rente​ - How did the test go?

tarente
New Contributor III

Hi Piper,

Unfortunately, I was not able to test it before I changed to a new employer, so I can no longer test it. However, I think it would work.

Regards,

Tiago R.

Anonymous
Not applicable

@tarente - Thanks for letting us know. 🙂

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.