Along withh several other issues I'm encountering, I am finding pandas dataframe to_sql being very slow
I am writing to an Azure SQL database and performance is woeful. This is a test database and it has S3 100DTU and one user, me as it's configuration. It's quick enough for most other tasks, but terrible for this.
I don't see any isssues in the database monitoring to suggest I've exceeded DTU or something.
This is my connection configuration, with appropriate redactions
connection_string = "Driver={ODBC Driver 18 for SQL Server};" \
"Server=azureservername.privatelink.database.windows.net;" \
"Database=datawarehousename;TrustServerCertificate=yes;" \
"Encrypt=yes;UID=dbuser;PWD=<pwd>"
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
engine = create_engine(connection_url, fast_executemany=True)
This is my dataframe.to_sql write to database
dfoutDB.to_sql("ResourceLogRpt", schema="reporting", con=engine, if_exists='append', chunksize=10000, index=False)
Changing chunksize isn't making a difference
In between reading data I've manipulated URL data to extract directory elements and parameters
Is there a way to verify the fast_executemany=True is doing anything?
Any other thoughts on diagnosing what is causing this to be slow.?
My cluster is 56 Gb and 8 cores DBR 1.4 with Scala 3.2.1, I would have thought enough as all other data frame manipulation is very fast, but if there's an impact that I need to understand I'd appreciate telling me what I misunderstood
Thanks
Peter