Databricks Community

enavuio · ‎09-22-2022

I have created an External table to Azure Data Lake Storage Gen2.

The Container has about 200K Json files.

The structure of the json files are created with

```

CREATE EXTERNAL TABLE IF NOT EXISTS dbo.table(

ComponentInfo STRUCT<ComponentHost: STRING, ComponentId: STRING, ComponentName: STRING, ComponentVersion: STRING, SubSystem: STRING>,

CorrelationId STRING,

Event STRUCT<Category: STRING, EventName: STRING, MessageId: STRING, PublishTime: STRING, SubCategory: STRING>,

References STRUCT<CorrelationId: STRING>)

USING org.apache.spark.sql.json OPTIONS ('multiLine' = 'true')

LOCATION 'dbfs:/mnt/mnt'

```

Counting takes such a long time to run and still at stage 62 with 754 tasks. Loading top 200 is fine but is there an incorrect setup that needs to be addressed. I have worked with Spark in AWS and decreased a Insert overwrite query to 1/2 the time so I am wondering if there is a better way to set this up.

Should it be partitioned?

Also, the Databricks Workspace is in US EAST and Storage account in US West 2 - could that be a culprit?

```

select count(*) from dbo.table

```

Debayan · ‎09-23-2022

Hi, transient network issues can be a problem. you can refer to https://learn.microsoft.com/en-us/azure/azure-sql/database/troubleshoot-common-errors-issues?view=az... . Also, It will be better to raise a azure support case so that the background network activities can be checked if the whole environment is set in Azure.

Anonymous · ‎10-13-2022

Hi @Ena Vu

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!