cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Madison
by New Contributor II
  • 2822 Views
  • 3 replies
  • 0 kudos

AnalysisException: [ErrorClass=INVALID_PARAMETER_VALUE] Missing cloud file system scheme

I am trying to follow along Apache Spark Programming training module where the instructor creates events table from a parquet file like this:%sql CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "/mnt/training/ecommerce/events/events.par...

Data Engineering
Databricks SQL
  • 2822 Views
  • 3 replies
  • 0 kudos
Latest Reply
Madison
New Contributor II
  • 0 kudos

@Kaniz Thanks for your response. I didn't provide cloud file system scheme in the path while creating the table using DataFrame API, but I was still able to create the table.  %python # File location and type file_location = "/mnt/training/ecommerce/...

  • 0 kudos
2 More Replies
XavierPereVives
by New Contributor II
  • 1034 Views
  • 3 replies
  • 0 kudos

Azure Shared Clusters - P4J Security Exception on non-whitelisted classes

When I try to use a third party JAR on an Azure shared cluster - which is installed via Maven and I can successfully import - , I get the following message:  py4j.security.Py4JSecurityException: Method public static org.apache.spark.sql.Column com.da...

  • 1034 Views
  • 3 replies
  • 0 kudos
Latest Reply
XavierPereVives
New Contributor II
  • 0 kudos

Thanks Kaniz.I must use a shared cluster because I'm reading from a DLT table stored in a Unity Catalog.https://docs.databricks.com/en/data-governance/unity-catalog/compute.htmlMy understanding is that shared clusters are enforcing the Py4J policy I ...

  • 0 kudos
2 More Replies
alemo
by New Contributor III
  • 959 Views
  • 3 replies
  • 1 kudos

Delta live table UC Kinesis: options overwriteschema, ignorechanges not supported for data sourc

I try to build a DLT in UC with Kinesis as producer.My first table looks like:  @dlt.create_table( table_properties={ "pipelines.autoOptimize.managed": "true" }, spark_conf={"spark.databricks.delta.schema.autoMerge.enabled": "true"},)def feed_chu...

  • 959 Views
  • 3 replies
  • 1 kudos
Latest Reply
Corbin
New Contributor III
  • 1 kudos

If you use the "Preview" Channel in the "Advanced" section of the DLT Pipeline, this error should resolve itself. This fix is planned to make it into the "Current" channel by Aug 31, 2023

  • 1 kudos
2 More Replies
vroste
by New Contributor III
  • 994 Views
  • 1 replies
  • 1 kudos

Resolved! Delta Live Tables maintenance schedule

I have a DLT that runs every day and an automatically executed maintenance job that runs within 24 hours every day. The maintenance operations are costly, is it possible to change the schedule to once a week or so?

  • 994 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

Hi @vroste, Based on the information provided, it is impossible to directly change the frequency of the automatic maintenance tasks performed by Delta Live Tables (DLT) from every 24 hours to once a week. The system is designed to perform maintenance...

  • 1 kudos
scvbelle
by New Contributor III
  • 1802 Views
  • 3 replies
  • 3 kudos

Resolved! DLT failure: ABFS does not allow files or directories to end with a dot

In my DLT pipeline outlined below which generically cleans identifier tables, after successfully creating initial streaming tables from the append-only sources, fails when trying to create the second cleaned tables witht the following:It'**bleep** cl...

Data Engineering
abfss
azure
dlt
engineering
  • 1802 Views
  • 3 replies
  • 3 kudos
Latest Reply
Priyanka_Biswas
Valued Contributor
  • 3 kudos

Hi @scvbelle The error message you're seeing is caused by an IllegalArgumentException error due to the restriction in Azure Blob File System (ABFS) that does not allow files or directories to end with a dot. This error is thrown by the trailingPeriod...

  • 3 kudos
2 More Replies
kinsun
by New Contributor II
  • 7189 Views
  • 5 replies
  • 0 kudos

Resolved! DBFS and Local File System Doubts

Dear Databricks Expert,I got some doubts when dealing with DBFS and Local File System.Case01: Copy a file from ADLS to DBFS. I am able to do so through the below python codes:#spark.conf.set("fs.azure.account.auth.type", "OAuth") spark.conf.set("fs.a...

  • 7189 Views
  • 5 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @KS LAU​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your q...

  • 0 kudos
4 More Replies
Nino
by Contributor
  • 4351 Views
  • 9 replies
  • 5 kudos

Resolved! Where in Hive Metastore can the s3 locations of Databricks tables be found?

I have a few Databricks clusters, some share a single Hive Metastore (HMS), call them PROD_CLUSTERS, and an additional cluster, ADHOC_CLUSTER, which has its own HMS. All my data is stored in S3, as Databricks delta tables: PROD_CLUSTERS have read-wri...

Data Engineering
HMS
metastore
  • 4351 Views
  • 9 replies
  • 5 kudos
Latest Reply
Kaniz
Community Manager
  • 5 kudos

Hi @Nino , To query HMS to get the full path for all data files of tables defined in that HMS, you can use the Hive MetaStore API. Specifically, you can use the GET_TABLE_FILES operation to retrieve the file metadata for a given table, including the ...

  • 5 kudos
8 More Replies
Soma
by Valued Contributor
  • 2391 Views
  • 10 replies
  • 2 kudos

spark streaming listener is lagging

We use pyspark streaming listener and it is lagging for 10 hrsThe data streamed in 10 am IST is logged at 10 PM IstCan someone explain how logging listener interface work

  • 2391 Views
  • 10 replies
  • 2 kudos
Latest Reply
jerrymark
New Contributor II
  • 2 kudos

When you're experiencing lag in Spark Streaming, it means that the system is not processing data in real-time, and there is a delay in data processing. This delay can be caused by various factors, and diagnosing and addressing the issue requires care...

  • 2 kudos
9 More Replies
sourander
by New Contributor III
  • 9550 Views
  • 15 replies
  • 7 kudos

Resolved! Protobuf deserialization in Databricks

Hi,​Let's assume I have these things:Binary column containing protobuf-serialized dataThe .proto file including message definition​What different approaches have Databricks users chosen to deserialize the data? Python is the programming language that...

  • 9550 Views
  • 15 replies
  • 7 kudos
Latest Reply
Amou
New Contributor II
  • 7 kudos

We've now added a native connector with parsing directly with Spark Dataframes. https://docs.databricks.com/en/structured-streaming/protocol-buffers.htmlfrom pyspark.sql.protobuf.functions import to_protobuf, from_protobuf schema_registry_options = ...

  • 7 kudos
14 More Replies
mjbobak
by New Contributor III
  • 11833 Views
  • 5 replies
  • 9 kudos

Resolved! How to import a helper module that uses databricks specific modules (dbutils)

I have a main databricks notebook that runs a handful of functions. In this notebook, I import a helper.py file that is in my same repo and when I execute the import everything looks fine. Inside my helper.py there's a function that leverages built-i...

  • 11833 Views
  • 5 replies
  • 9 kudos
Latest Reply
amitca71
Contributor II
  • 9 kudos

Hi,i 'm facing similiar issue, when deploying via dbx.I have an helper notebook, that when executing it via jobs works fine (without any includes)while i deploy it via dbx (to same cluster), the helper notebook results withdbutils.fs.ls(path)NameEr...

  • 9 kudos
4 More Replies
RC
by Contributor
  • 1615 Views
  • 3 replies
  • 1 kudos

Error while creating table with Glue catalog

Hi, I have Databricks cluster earlier connected to hive metastore and we have started migrating to Glue catalog.I'm facing an issue while creating table,Path must be absolute: <table-name>-__PLACEHOLDER__We have provided full access to glue and s3 in...

  • 1615 Views
  • 3 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

 Hi @RC, The error message you're seeing suggests that the table path is not absolute. This could be due to how you create the table in the Glue Catalog. As per the given sources, when using AWS Glue Data Catalog as the metastore, it's recommended to...

  • 1 kudos
2 More Replies
Nasreddin
by New Contributor
  • 4113 Views
  • 2 replies
  • 0 kudos

ColumnTransformer not fitted after sklearn Pipeline loaded from Mlflow

I am building a machine learning model using sklearn Pipeline which includes a ColumnTransformer as a preprocessor before the actual model. Below is the code how the pipeline is created.transformers = [] num_pipe = Pipeline(steps=[ ('imputer', Si...

  • 4113 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

Hi @Nasreddin, MLflow is compatible with sklearn Pipeline with multiple steps. The error you're encountering, "This ColumnTransformer instance is not fitted yet. Call’ fit’ with appropriate arguments before using this estimator." is likely because  C...

  • 0 kudos
1 More Replies
jagger9919
by New Contributor II
  • 3947 Views
  • 11 replies
  • 7 kudos

Resolved! Unable to login to community edition

Hello there,I have successfully created a databricks account and went to login to the community edition with the exact same login credentials as my account, but it tells me that the email/password are invalid. I can login with these same exact creden...

  • 3947 Views
  • 11 replies
  • 7 kudos
Latest Reply
Kaniz
Community Manager
  • 7 kudos

Hi @jagger9919 ,  Please look at this link related to the Community - Edition, which might solve your problem.   I appreciate your interest in sharing your Community-Edition query with us. However, at this time, we are not entertaining any Community-...

  • 7 kudos
10 More Replies
lnights
by New Contributor II
  • 2135 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 2135 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
Labels
Top Kudoed Authors