โ03-09-2015 05:27 PM
โ03-09-2015 05:33 PM
registerTempTable()
registerTempTable() creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format.
This is important for dashboards as dashboards running in a different cluster (ie. the single Dashboard Cluster) will not have access to the temp tables registered in another cluster.
Re-registering a temp table of the same name (using overwrite=true) but with new data causes an atomic memory pointer switch so the new data is seemlessly updated and immediately accessble for querying (ie. from a Dashboard).
saveAsTable()
saveAsTable() creates a permanent, physical table stored in S3 using the Parquet format. This table is accessible to all clusters including the dashboard cluster. The table metadata including the location of the file(s) is stored within the Hive metastore.
Re-creating a permanent table of the same name (using overwrite=true) but with new data causes the old data to be deleted and the new data to be saved in the same underlying file on S3. This may lead to moments when the data is not available due to S3's eventual consistency model. There are on-going improvements to reduce this down time, however.
โ03-12-2015 01:47 PM
I'm a extreme beginner with Spark, so I'm probably missing something big. Using saveAsTable(), how can I specify where to store the parquet file(s) in S3? SaveAsTable accepts only a table name, and saves data in the dbfs at this location /user/hive/warehouse/. I already mounted S3 with dbutils.fs.mount in /mnt/lake. Thanks
โ03-12-2015 05:42 PM
@Claudio Beretta - you are likely looking for the
saveAsParquet()
operation. You can find out more about that and other operations in the API documentation for SchemaRDDs.
One important note:
SchemaRDD
will be changed to DataFrame
in an upcoming release.
โ03-12-2015 06:44 PM
Thanks @Pat McDonoughโ , I tried to use saveAsParquet(s"s3n://...") earlier but it complained with "java.lang.RuntimeException: Unsupported datatype TimestampType".
About saveAsTable() I liked that it persists the data and registers the table at the same time. If only it could save it to S3, as the answer states, it would be perfect for what I was trying to do.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.