Class ChangelogIterator

  • All Implemented Interfaces:
    java.util.Iterator<org.apache.spark.sql.Row>

    public class ChangelogIterator
    extends java.lang.Object
    implements java.util.Iterator<org.apache.spark.sql.Row>
    An iterator that transforms rows from changelog tables within a single Spark task. It assumes that rows are sorted by identifier columns and change type.

    It removes the carry-over rows. Carry-over rows are the result of a removal and insertion of the same row within an operation because of the copy-on-write mechanism. For example, given a file which contains row1 (id=1, data='a') and row2 (id=2, data='b'). A copy-on-write delete of row2 would require erasing this file and preserving row1 in a new file. The change-log table would report this as (id=1, data='a', op='DELETE') and (id=1, data='a', op='INSERT'), despite it not being an actual change to the table. The iterator finds the carry-over rows and removes them from the result.

    This iterator also finds delete/insert rows which represent an update, and converts them into update records. For example, these two rows

    • (id=1, data='a', op='DELETE')
    • (id=1, data='b', op='INSERT')

    will be marked as update-rows:

    • (id=1, data='a', op='UPDATE_BEFORE')
    • (id=1, data='b', op='UPDATE_AFTER')
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static java.util.Iterator<org.apache.spark.sql.Row> create​(java.util.Iterator<org.apache.spark.sql.Row> rowIterator, org.apache.spark.sql.types.StructType rowType, java.lang.String[] identifierFields)
      Creates an iterator for records of a changelog table.
      boolean hasNext()  
      org.apache.spark.sql.Row next()  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface java.util.Iterator

        forEachRemaining, remove
    • Method Detail

      • create

        public static java.util.Iterator<org.apache.spark.sql.Row> create​(java.util.Iterator<org.apache.spark.sql.Row> rowIterator,
                                                                          org.apache.spark.sql.types.StructType rowType,
                                                                          java.lang.String[] identifierFields)
        Creates an iterator for records of a changelog table.
        Parameters:
        rowIterator - the iterator of rows from a changelog table
        rowType - the schema of the rows
        identifierFields - the names of the identifier columns, which determine if rows are the same
        Returns:
        a new ChangelogIterator instance concatenated with the null-removal iterator
      • hasNext

        public boolean hasNext()
        Specified by:
        hasNext in interface java.util.Iterator<org.apache.spark.sql.Row>
      • next

        public org.apache.spark.sql.Row next()
        Specified by:
        next in interface java.util.Iterator<org.apache.spark.sql.Row>