Databricks Community

cfregly · ‎03-09-2015

cfregly · ‎03-09-2015

registerTempTable()

registerTempTable() creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format.

This is important for dashboards as dashboards running in a different cluster (ie. the single Dashboard Cluster) will not have access to the temp tables registered in another cluster.

Re-registering a temp table of the same name (using overwrite=true) but with new data causes an atomic memory pointer switch so the new data is seemlessly updated and immediately accessble for querying (ie. from a Dashboard).

saveAsTable()

saveAsTable() creates a permanent, physical table stored in S3 using the Parquet format. This table is accessible to all clusters including the dashboard cluster. The table metadata including the location of the file(s) is stored within the Hive metastore.

Re-creating a permanent table of the same name (using overwrite=true) but with new data causes the old data to be deleted and the new data to be saved in the same underlying file on S3. This may lead to moments when the data is not available due to S3's eventual consistency model. There are on-going improvements to reduce this down time, however.

ClaudioBeretta · ‎03-12-2015

I'm a extreme beginner with Spark, so I'm probably missing something big. Using saveAsTable(), how can I specify where to store the parquet file(s) in S3? SaveAsTable accepts only a table name, and saves data in the dbfs at this location /user/hive/warehouse/. I already mounted S3 with dbutils.fs.mount in /mnt/lake. Thanks

Anonymous · ‎03-12-2015

@Claudio Beretta - you are likely looking for the

saveAsParquet()

operation. You can find out more about that and other operations in the API documentation for SchemaRDDs.

One important note:

SchemaRDD

will be changed to

DataFrame

in an upcoming release.

ClaudioBeretta · ‎03-12-2015

Thanks @Pat McDonough , I tried to use saveAsParquet(s"s3n://...") earlier but it complained with "java.lang.RuntimeException: Unsupported datatype TimestampType".

About saveAsTable() I liked that it persists the data and registers the table at the same time. If only it could save it to S3, as the answer states, it would be perfect for what I was trying to do.

cfregly · ‎04-06-2015

@Claudio Beretta you're likely hitting this Spark SQL bug:https://issues.apache.org/jira/browse/SPARK-4987

it's been resolved in Spark 1.3 and will be picked up by Databricks Cloud in the next few weeks.

thanks for reporting this. possible workaround is to convert to Unix epoch time using to_unixtime(), from_unixtime().

sorry for the inconvenience.

-Chris

MainulHassan · ‎02-26-2016

When I query on this S3 table, are they loaded in memory before any computation? What does the create table command do then? Just storing metadata in metastore?

anton · ‎11-06-2015

Are parquet files stored on my s3? I can not see them, i was expecting something like "/_tmp_tables" folder, but couldn't find anything!

Anonymous · ‎11-11-2015

@anton - the root of DBFS is configured to point to an S3 bucket in the Databricks account. There is an update coming soon to allow for users to define where this root bucket should point to. In lieu of that, you'll want to point to a specific DBFS mount point or S3 URL when saving your data if you'd like to subsequently access it directly from S3.

anton · ‎11-16-2015

In that case - what is the pricing model? What if parquet table is really big, does hosting it on DB account included in the subscription price? is there a limit on the size of the table?

Anonymous · ‎11-16-2015

@anton - we'll reach out to you offline to talk in more detail, but the main answer is that Databricks is only a compute service and uses other, existing storage systems. So you are only paying for compute nodes in Databricks.

max_tenerowicz · ‎02-12-2016

This video show that registerTempTable can be used to share table between sessions. Is it Databricks platform specific feature or can do something like this in general? If so how is it done? As far as I know Hive temporary tables are bound to the specific session. Is it some kind of SQLContext injection?

sharma_raghav · ‎03-14-2017

I need to accomplish the same myself, different classes need to share a dataframe/temp table.

rlgarris · ‎02-17-2016

Hi @cfregly

One correction. registerTempTable does not create a "cached" in-memory table, but rather an alias or a reference to the DataFrame. It's akin to a pointer in C/C++ or a reference in Java.

Every time you query the temp table you are actually re-executing the DAG unless you explicitly cache the DataFrame.