<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pythonexception-runtimeerror-the-length-of-output-in-scalar/m-p/11274#M6280</link>
    <description>&lt;P&gt;@Kaniz Fatma​&amp;nbsp; Can you please help me on pandas_udf ?&lt;/P&gt;&lt;P&gt;Above scenario I have used regular expressions, for that we have our spark method, but I have other pandas_udf have same issue.&lt;/P&gt;</description>
    <pubDate>Mon, 23 Jan 2023 01:33:17 GMT</pubDate>
    <dc:creator>Ancil</dc:creator>
    <dc:date>2023-01-23T01:33:17Z</dc:date>
    <item>
      <title>PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.</title>
      <link>https://community.databricks.com/t5/data-engineering/pythonexception-runtimeerror-the-length-of-output-in-scalar/m-p/11273#M6279</link>
      <description>&lt;P&gt;I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error.&lt;/P&gt;&lt;P&gt;PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please find below code&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;data =[{"inputData":"&amp;lt;html&amp;gt;Tanuj is older than Eina. Chetan is older than Tanuj. Eina is older than Chetan. If the first 2 statements are true, the 3rd statement is"},{"inputData":"&amp;lt;html&amp;gt;Pens cost more than pencils. Pens cost less than eraser. Erasers cost more than pencils and pens. If the first two statements are true, the third statement is"},{"inputData":"&amp;lt;html&amp;gt;If we have a tree of n nodes, how many edges will it have?"}, {"inputData":"&amp;lt;div&amp;gt;Which of the following data structures can handle updates and queries in log(n) time on an array?"}]&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;df = spark.createDataFrame(data)&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;# removing HTML tags from the input text
@pandas_udf(StringType())
def clean_html(raw_htmls: Iterator[pd.Series]) -&amp;gt; Iterator[pd.Series]:
    pd.set_option('display.max_colwidth', 10000)
    for raw_html in raw_htmls:
        cleanr_regx = re.compile("&amp;lt;.*?&amp;gt;|&amp;amp;([a-z0-9]+|#0-9{1,6}|#x[0-9a-f]{1,6});")
        cleantext = re.sub(cleanr_regx, " ", raw_html.to_string(index=False))
        cleantext = re.sub(" +", " ", cleantext)
        yield pd.Series(cleantext)&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;df = df.withColumn("Question",clean_html("inputData"))
display(df)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Its working fine. But if I add one more row to data, getting above mentioned error.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;data =[{"inputData":"&amp;lt;div&amp;gt;Look at this series: 36, 34, 30, 28, 24, … What number should come next?"},{"inputData":"&amp;lt;html&amp;gt;Tanuj is older than Eina. Chetan is older than Tanuj. Eina is older than Chetan. If the first 2 statements are true, the 3rd statement is"},{"inputData":"&amp;lt;html&amp;gt;Pens cost more than pencils. Pens cost less than eraser. Erasers cost more than pencils and pens. If the first two statements are true, the third statement is"},{"inputData":"&amp;lt;html&amp;gt;If we have a tree of n nodes, how many edges will it have?"}, {"inputData":"&amp;lt;div&amp;gt;Which of the following data structures can handle updates and queries in log(n) time on an array?"}]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;In my project am reading data from json file, there is also same issue, if its 1 row its working, but more than 1 am getting same , &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any one please helps me, am stuck for a week with same error.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Cluster : 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2023 18:46:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pythonexception-runtimeerror-the-length-of-output-in-scalar/m-p/11273#M6279</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2023-01-18T18:46:57Z</dc:date>
    </item>
    <item>
      <title>Re: PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's; however, the length of output was 1 and the length of input was 2.'.</title>
      <link>https://community.databricks.com/t5/data-engineering/pythonexception-runtimeerror-the-length-of-output-in-scalar/m-p/11274#M6280</link>
      <description>&lt;P&gt;@Kaniz Fatma​&amp;nbsp; Can you please help me on pandas_udf ?&lt;/P&gt;&lt;P&gt;Above scenario I have used regular expressions, for that we have our spark method, but I have other pandas_udf have same issue.&lt;/P&gt;</description>
      <pubDate>Mon, 23 Jan 2023 01:33:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pythonexception-runtimeerror-the-length-of-output-in-scalar/m-p/11274#M6280</guid>
      <dc:creator>Ancil</dc:creator>
      <dc:date>2023-01-23T01:33:17Z</dc:date>
    </item>
  </channel>
</rss>

