Hi @twbde
This is a genuinely tricky problem. Here's the diagnosis and the best available workarounds:
Root Cause: useLargeVarTypes Is Not Wired Into transformWithStateInPandas
Your instinct is correct. The spark.sql.execution.arrow.useLargeVarTypes config is not respected by the transformWithStateInPandas serializer path. Looking at how Spark's Arrow infrastructure is built, useLargeVarTypes is plumbed through toPandas() / createDataFrame() and the general Pandas UDF serializers — but transformWithStateInPandas has its own dedicated serializer (TransformWithStateInPandasSerializer), and the largeVarTypes flag is simply not passed through to its ArrowWriter.create(...) call
The result: Arrow still uses the standard VarCharVector / VarBinaryVector which have a hard 2GB cap per column per batch — and even maxRecordsPerBatch=1 won't help if a single file's content already exceeds 2GB, or if the cumulative buffer for the column in a single batch hits that limit through Arrow's internal resizing logic.
Workarounds (in order of preference)
1. Pre-chunk large file content in your input stream (most reliable)
Before the data reaches transformWithStateInPandas, break the file-content column into chunks smaller than, say, 500MB each. Tag each chunk with a sequence number and the grouping key. Reassemble in the stateful processor using your state store.
Then reassemble in your StatefulProcessor's handleInputRows using a ListState or MapState keyed by chunk_seq, and only emit once all chunks have arrived.
2. Use BinaryType instead of StringType for the file content column
If your column is typed as StringType, Arrow uses VarCharVector. If you cast it to BinaryType, Arrow uses VarBinaryVector — same 2GB wall, but sometimes the buffer growth patterns differ and you can at least get slightly further. More importantly, it makes chunking and reassembly more natural for binary file content.
LR