RDD Parallelism without delta_log
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-14-2024 03:27 PM
I've set up my script to be able to use a multinode cluster, but am running into an issue when iterating on a list of .json files to sink to sql table via a JDBC driver. The primary response is that a delta_log file (that I can't see in my blob container) is causing my original file path list to break. Where is this delta_log file and how do I avoid this failure?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-14-2024 04:51 PM
Hi @Wildabeast,
Is your table a Delta table? And are you getting any failures?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-15-2024 11:56 AM
No, it's an ssms dbo table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2024 05:22 AM
And what is the error that you are getting?
Can you do one test, you can filter out the _delta_log
directory when listing the files. Here is an example of how you can do this in Python
import os
# List all files in the directory
all_files = dbutils.fs.ls("path/to/your/directory")
# Filter out the _delta_log directory
json_files = [file.path for file in all_files if "_delta_log" not in file.path]
# Now you can iterate over json_files
for file_path in json_files:
# Your code to process each JSON file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2024 05:56 AM
I actually submitted the error and my .py file....for some reason it didn't post yesterday. Let me pull it while I'm running your filter suggestion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2024 06:17 AM - edited 12-16-2024 06:21 AM
It's in Azure, we just named in awsindividual because it's a migration.
Error:
Processing file: https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-45/Feed/Company/1570546/Comp... File name: CompanyContacts.JSON, Table: dbo.CompanyContacts, Schema: StructType([StructField('Contacts', ArrayType(StructType([StructField('Address', StructType([StructField('AddressLine1', StringType(), True), StructField('AddressLine2', StringType(), True), StructField('City', StringType(), True), StructField('StateProvince', StringType(), True), StructField('Country', StringType(), True), StructField('PostalCode', StringType(), True), StructField('County', StringType(), True)]), True), StructField('BillingAddress', StructType([StructField('AddressLine1', StringType(), True), StructField('AddressLine2', StringType(), True), StructField('City', StringType(), True), StructField('StateProvince', StringType(), True), StructField('Country', StringType(), True), StructField('PostalCode', StringType(), True), StructField('County', StringType(), True)]), True), StructField('ContactID', IntegerType(), True), StructField('CorrespondenceEmail', StringType(), True), StructField('InquiryEmail', StringType(), True), StructField('MailingAddress', StructType([StructField('AddressLine1', StringType(), True), StructField('AddressLine2', StringType(), True), StructField('City', StringType(), True), StructField('StateProvince', StringType(), True), StructField('Country', StringType(), True), StructField('PostalCode', StringType(), True), StructField('County', StringType(), True)]), True), StructField('MainPhone', StructType([StructField('Number', StringType(), True), StructField('Extension', StringType(), True)]), True), StructField('OtherPhones', ArrayType(StringType(), True), True), StructField('Website', StringType(), True)]), True), True), StructField('CompanyID', IntegerType(), True)]) Error reading file https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-45/Feed/Company/1570546/Comp...: [DELTA_INVALID_FORMAT] Incompatible format detected. A transaction log for Delta was found at `https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-45/Feed/Company/1570546/Comp..., but you are trying to read from `https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-45/Feed/Company/1570546/Comp... using format("json"). You must use 'format("delta")' when reading and writing to a delta table. To learn more about Delta, see https://docs.microsoft.com/azure/databricks/delta/index Processing Results: Skipped: https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-45/Feed/Company/1570546/Comp... Processing Completed!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2024 07:07 AM
This is impossible, there's no delta_log:
Failed to process raw JSON: https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-3/Feed/Company/1014148/Compa... - [DELTA_INVALID_FORMAT] Incompatible format detected. A transaction log for Delta was found at `https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-3/Feed/Company/1014148/Compa...`, but you are trying to read from `https://awsindividual.blob.core.windows.net/awsdashdata1/dash.test/Sync-3/Feed/Company/1014148/Compa...` using format("text"). You must use 'format("delta")' when reading and writing to a delta table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2024 11:47 AM
Could it be that our cluster doesn't have the delta lake libraries loaded?
maven JAR coordinates:
Maven: io.delta:delta-core_2.12:2.4.0

