cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark - how to save the schema of a csv file in a delta table's column

tarente
New Contributor III

How to save the schema of a csv file in a delta table's column?

In a previous project implemented in Databricks using Scala notebooks, we stored the schema of csv files as a "json string" in a SQL Server table.

When we needed to read or write the csv and the source dataframe das 0 rows, or the source csv does not exist, we use the schema stored in the SQL Server to either create an empty dataframe or empty csv file.

Now, I would like to implement something similar in Databricks but using Python notebook and store the schema of csv files in a delta table.

Any suggestions?

Thanks in advance,

Tiago.

1 ACCEPTED SOLUTION

Accepted Solutions

RKNutalapati
Valued Contributor

Hi @Tiago Rente​ , Hope below code would help.

image

View solution in original post

9 REPLIES 9

Hubert-Dudek
Esteemed Contributor III

After you read csv to dataframe spark.read.csv ... there are 3 ways

DataFrame.Schema

DataFrame.printSchema() - it is StructType

and 3rd tricky way is DDL string

DataFrame._jdf.schema().toDDL()

Usually DDL as it is simple string is easiest to save somewhere and than reuse. Just insert to some delta table schema and then select when needed.

tarente
New Contributor III

Hi Hubert,

Thanks for you answer, but I was not able to make it work.

Let me ask the question in a different way.

I have a csv file with the following basic estruture:

  • ProductId - integer.
  • ProductDesc - string.
  • ProductCost - decimal.

In PySpark I would like to store the file schema in:

  1. In a variable to be used in the spark.read.schema(schema).options(**fileOptions).schema(schema).load(...).
  2. Be able to store the file schema in a delta table's column.

What kind of transformations do I need to do to the variable in 1. to be able to stored in 2., and vice-versa?

Thanks in advance,

Tiago R.

Hi @Tiago Rente​ , Hope this would help.

csv_file= spark.read.csv("/path/to/input/data",header=True,sep=","); 
csv_file.write.format("delta").mode("overwrite").option('overwriteSchema','true').save("/mnt/delta/product") 
spark.sql("CREATE TABLE employee USING DELTA LOCATION '/mnt/delta/product/'")

tarente
New Contributor III

Hi Kaniz,

Thanks for your answer, although it did not answer my questions.

RKNutalapati
Valued Contributor

Hi @Tiago Rente​ , Hope below code would help.

image

tarente
New Contributor III

Hi,

Thanks for you code, I will test it.

Regards,

Tiago.

Anonymous
Not applicable

@Tiago Rente​ - How did the test go?

tarente
New Contributor III

Hi Piper,

Unfortunately, I was not able to test it before I changed to a new employer, so I can no longer test it. However, I think it would work.

Regards,

Tiago R.

Anonymous
Not applicable

@tarente - Thanks for letting us know. 🙂

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group