โ03-09-2015 05:27 PM
โ03-09-2015 05:33 PM
registerTempTable()
registerTempTable() creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format.
This is important for dashboards as dashboards running in a different cluster (ie. the single Dashboard Cluster) will not have access to the temp tables registered in another cluster.
Re-registering a temp table of the same name (using overwrite=true) but with new data causes an atomic memory pointer switch so the new data is seemlessly updated and immediately accessble for querying (ie. from a Dashboard).
saveAsTable()
saveAsTable() creates a permanent, physical table stored in S3 using the Parquet format. This table is accessible to all clusters including the dashboard cluster. The table metadata including the location of the file(s) is stored within the Hive metastore.
Re-creating a permanent table of the same name (using overwrite=true) but with new data causes the old data to be deleted and the new data to be saved in the same underlying file on S3. This may lead to moments when the data is not available due to S3's eventual consistency model. There are on-going improvements to reduce this down time, however.
โ03-12-2015 01:47 PM
I'm a extreme beginner with Spark, so I'm probably missing something big. Using saveAsTable(), how can I specify where to store the parquet file(s) in S3? SaveAsTable accepts only a table name, and saves data in the dbfs at this location /user/hive/warehouse/. I already mounted S3 with dbutils.fs.mount in /mnt/lake. Thanks
โ03-12-2015 05:42 PM
@Claudio Beretta - you are likely looking for the
saveAsParquet()
operation. You can find out more about that and other operations in the API documentation for SchemaRDDs.
One important note:
SchemaRDD
will be changed to DataFrame
in an upcoming release.
โ03-12-2015 06:44 PM
Thanks @Pat McDonoughโ , I tried to use saveAsParquet(s"s3n://...") earlier but it complained with "java.lang.RuntimeException: Unsupported datatype TimestampType".
About saveAsTable() I liked that it persists the data and registers the table at the same time. If only it could save it to S3, as the answer states, it would be perfect for what I was trying to do.
โ04-06-2015 10:12 AM
@Claudio Berettaโ you're likely hitting this Spark SQL bug:https://issues.apache.org/jira/browse/SPARK-4987
it's been resolved in Spark 1.3 and will be picked up by Databricks Cloud in the next few weeks.
thanks for reporting this. possible workaround is to convert to Unix epoch time using to_unixtime(), from_unixtime().
sorry for the inconvenience.
-Chris
โ02-26-2016 05:48 PM
When I query on this S3 table, are they loaded in memory before any computation? What does the create table command do then? Just storing metadata in metastore?
โ11-06-2015 04:59 PM
Are parquet files stored on my s3? I can not see them, i was expecting something like "/_tmp_tables" folder, but couldn't find anything!
โ11-11-2015 09:22 AM
@anton - the root of DBFS is configured to point to an S3 bucket in the Databricks account. There is an update coming soon to allow for users to define where this root bucket should point to. In lieu of that, you'll want to point to a specific DBFS mount point or S3 URL when saving your data if you'd like to subsequently access it directly from S3.
โ11-16-2015 04:11 AM
In that case - what is the pricing model? What if parquet table is really big, does hosting it on DB account included in the subscription price? is there a limit on the size of the table?
โ11-16-2015 04:21 PM
@anton - we'll reach out to you offline to talk in more detail, but the main answer is that Databricks is only a compute service and uses other, existing storage systems. So you are only paying for compute nodes in Databricks.
โ02-12-2016 04:17 PM
This video show that registerTempTable can be used to share table between sessions. Is it Databricks platform specific feature or can do something like this in general? If so how is it done? As far as I know Hive temporary tables are bound to the specific session. Is it some kind of SQLContext injection?
โ03-14-2017 10:32 AM
I need to accomplish the same myself, different classes need to share a dataframe/temp table.
โ02-17-2016 04:58 AM
Hi @cfregly
One correction. registerTempTable does not create a "cached" in-memory table, but rather an alias or a reference to the DataFrame. It's akin to a pointer in C/C++ or a reference in Java.
Every time you query the temp table you are actually re-executing the DAG unless you explicitly cache the DataFrame.
โ03-14-2018 02:41 PM
Can you please elaborate how it is different from in-memory table, because dataframe lives in-memory as long as the program is running right?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group