cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How many records does Spark use to infer the schema? entire file or just the first "X" number of records?

User15787040559
New Contributor III

It depends.

If you specify the schema it will be zero, otherwise it will do a full file scan which doesnโ€™t work well processing Big Data at a large scale.

CSV files Dataframe Reader https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.csv.html?h... samplingRatio will let you change how you sample the data on inference.

1 REPLY 1

Anand_Ladda
Honored Contributor II

As indicated there are ways to manage the amount of data being sampled for inferring schema. However as a best practice for production workloads its always best to define the schema explicitly for consistency, repeatability and robustness of the pipelines. It also helps with implementing effective data quality checks using features like schema enforcement and expectations in Delta Live Tables

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.