cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the difference between registerTempTable() and saveAsTable()?

cfregly
Contributor
 
15 REPLIES 15

cfregly
Contributor

registerTempTable()

registerTempTable() creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format.

This is important for dashboards as dashboards running in a different cluster (ie. the single Dashboard Cluster) will not have access to the temp tables registered in another cluster.

Re-registering a temp table of the same name (using overwrite=true) but with new data causes an atomic memory pointer switch so the new data is seemlessly updated and immediately accessble for querying (ie. from a Dashboard).

saveAsTable()

saveAsTable() creates a permanent, physical table stored in S3 using the Parquet format. This table is accessible to all clusters including the dashboard cluster. The table metadata including the location of the file(s) is stored within the Hive metastore.

Re-creating a permanent table of the same name (using overwrite=true) but with new data causes the old data to be deleted and the new data to be saved in the same underlying file on S3. This may lead to moments when the data is not available due to S3's eventual consistency model. There are on-going improvements to reduce this down time, however.

I'm a extreme beginner with Spark, so I'm probably missing something big. Using saveAsTable(), how can I specify where to store the parquet file(s) in S3? SaveAsTable accepts only a table name, and saves data in the dbfs at this location /user/hive/warehouse/. I already mounted S3 with dbutils.fs.mount in /mnt/lake. Thanks

Anonymous
Not applicable

@Claudio Beretta - you are likely looking for the

saveAsParquet()
operation. You can find out more about that and other operations in the API documentation for SchemaRDDs.

One important note:

SchemaRDD
will be changed to
DataFrame
in an upcoming release.

Thanks @Pat McDonough​ , I tried to use saveAsParquet(s"s3n://...") earlier but it complained with "java.lang.RuntimeException: Unsupported datatype TimestampType".

About saveAsTable() I liked that it persists the data and registers the table at the same time. If only it could save it to S3, as the answer states, it would be perfect for what I was trying to do.

@Claudio Beretta​  you're likely hitting this Spark SQL bug:https://issues.apache.org/jira/browse/SPARK-4987

it's been resolved in Spark 1.3 and will be picked up by Databricks Cloud in the next few weeks.

thanks for reporting this. possible workaround is to convert to Unix epoch time using to_unixtime(), from_unixtime().

sorry for the inconvenience.

-Chris

When I query on this S3 table, are they loaded in memory before any computation? What does the create table command do then? Just storing metadata in metastore?

anton
New Contributor II

Are parquet files stored on my s3? I can not see them, i was expecting something like "/_tmp_tables" folder, but couldn't find anything!

Anonymous
Not applicable

@anton - the root of DBFS is configured to point to an S3 bucket in the Databricks account. There is an update coming soon to allow for users to define where this root bucket should point to. In lieu of that, you'll want to point to a specific DBFS mount point or S3 URL when saving your data if you'd like to subsequently access it directly from S3.

anton
New Contributor II

In that case - what is the pricing model? What if parquet table is really big, does hosting it on DB account included in the subscription price? is there a limit on the size of the table?

Anonymous
Not applicable

@anton - we'll reach out to you offline to talk in more detail, but the main answer is that Databricks is only a compute service and uses other, existing storage systems. So you are only paying for compute nodes in Databricks.

max_tenerowicz
New Contributor II

This video show that registerTempTable can be used to share table between sessions. Is it Databricks platform specific feature or can do something like this in general? If so how is it done? As far as I know Hive temporary tables are bound to the specific session. Is it some kind of SQLContext injection?

I need to accomplish the same myself, different classes need to share a dataframe/temp table.

rlgarris
New Contributor III
New Contributor III

Hi @cfregly

One correction. registerTempTable does not create a "cached" in-memory table, but rather an alias or a reference to the DataFrame. It's akin to a pointer in C/C++ or a reference in Java.

Every time you query the temp table you are actually re-executing the DAG unless you explicitly cache the DataFrame.

Can you please elaborate how it is different from in-memory table, because dataframe lives in-memory as long as the program is running right?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!