I have a table, full scan of which takes ~20 minutes on my cluster. The table has "Time" TIMESTAMP column and "day" DATE column. The latter is computed (manually) as "Time" truncated to day and used for partitioning.I query the table using predicate ...
Hi @Vladimir Ryabtsev​, We haven’t heard from you since the last response from ​​@Uma Maheswara Rao Desula​, and I was checking back to see if their suggestions helped you.Or else, If you have any solution, please share it with the community, as it c...
We have a streaming data written into delta. We will not write all the partitions every day. Hence i am thinking of running compact spark job, to run only on partitions that has been modified yesterday. Is it possible to query the partitionsValues wr...
Hi @Gnanasoundari Soundarajan​ , Just a friendly follow-up. Do you still need help, or @Deepak Bhutada​ 's response help you to find the solution? Please let us know.
I'm new to Spark and trying to understand how some of its components work.I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).But I'm wondering whether the in...
Hi @Narek Margaryan​, Just a friendly follow-up. Do you still need help, or does the above response help you to find the solution? Please let us know.
getting error as below while creating buckets on delta table.Error in SQL statement: AnalysisException: Delta bucketed tables are not supported.have fall back to parquet table due to this for some use cases. is their any alternative for this. i have...
Hi @Rahul Samant​ , we checked internally on this due to certain limitations bucketing is not supported on delta tables, the only alternative for bucketing is to leverage the z ordering, below is the link for reference https://docs.databricks.com/de...
(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):Setting: Delta-lake, Databricks SQL compute used by powerbi. I am wondering about the following scenario: We have a column `timest...
In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.
I feel there is some issue with a few partitions of the delta file. The optimize runs fine and completes within few minutes for other partitions but for this particular partition the optimize keeps running forever.
OPTIMIZE delta.`/mnt/prod-abc/Ini...
Partitioning is a way of distributing the data by keys so that you can restrict the amount of data scanned by each query and improve performance / avoid conflicts General rules of thumb for choosing the right partition columns Cardinality of a colu...
Partition pruning is the most efficient way to ensure Data skipping. However, choosing the right column for partitioning is very important. It's common to see choosing the wrong column for partitioning can cause a large number of small file problems ...
Hi,
I need some guide lines for a performance issue with Parquet files :
I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path )
My parquet folder has 6 sub division keys
It was initially ok with a first sample of data...
Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writ...