KeyValueGroupedDataset

abstract class KeyValueGroupedDataset[K, V] extends Serializable

A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call groupByKey on an existing Dataset.

Source: KeyValueGroupedDataset.scala
Since: 2.0.0

Linear Supertypes

Serializable, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

KeyValueGroupedDataset
Serializable
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Instance Constructors

new KeyValueGroupedDataset()

Abstract Value Members

abstract def aggUntyped(columns: TypedColumn[_, _]*): Dataset[_]
Internal helper function for building typed aggregations that return tuples.
Internal helper function for building typed aggregations that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.
Attributes
protected
abstract def cogroupSorted[U, R](other: KeyValueGroupedDataset[K, U])(thisSortExprs: Column*)(otherSortExprs: Column*)(f: (K, Iterator[V], Iterator[U]) => IterableOnce[R])(implicit arg0: Encoder[R]): Dataset[R]
(Scala-specific) Applies the given function to each sorted cogrouped data.
(Scala-specific) Applies the given function to each sorted cogrouped data. For each unique group, the function will be passed the grouping key and 2 sorted iterators containing all elements in the group from Dataset this and other. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This is equivalent to KeyValueGroupedDataset#cogroup, except for the iterators to be sorted according to the given sort expressions. That sorting does not add computational complexity.
Since
3.4.0
See also
org.apache.spark.sql.api.KeyValueGroupedDataset#cogroup
abstract def flatMapGroupsWithState[S, U](outputMode: OutputMode, timeoutConf: GroupStateTimeout, initialState: KeyValueGroupedDataset[K, S])(func: (K, Iterator[V], GroupState[S]) => Iterator[U])(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
outputMode
The output mode of the function.
timeoutConf
Timeout configuration for groups that do not receive data for a while.
initialState
The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group. To covert a Dataset ds of type of type Dataset[(K, S)] to a KeyValueGroupedDataset[K, S], use
```
ds.groupByKey(x => x._1).mapValues(_._2)
```
See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
func
Function to be called on every group.
Since
3.2.0
abstract def flatMapGroupsWithState[S, U](outputMode: OutputMode, timeoutConf: GroupStateTimeout)(func: (K, Iterator[V], GroupState[S]) => Iterator[U])(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
outputMode
The output mode of the function.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
func
Function to be called on every group.
Since
2.2.0
abstract def flatMapSortedGroups[U](sortExprs: Column*)(f: (K, Iterator[V]) => IterableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data.
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and a sorted iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
This is equivalent to KeyValueGroupedDataset#flatMapGroups, except for the iterator to be sorted according to the given sort expressions. That sorting does not add computational complexity.
Since
3.4.0
See also
org.apache.spark.sql.api.KeyValueGroupedDataset#flatMapGroups
abstract def keyAs[L](implicit arg0: Encoder[L]): KeyValueGroupedDataset[L, V]
Returns a new KeyValueGroupedDataset where the type of the key has been mapped to the specified type.
Returns a new KeyValueGroupedDataset where the type of the key has been mapped to the specified type. The mapping of key columns to the type follows the same rules as as on Dataset.
Since
1.6.0
abstract def keys: Dataset[K]
Returns a Dataset that contains each unique key.
Returns a Dataset that contains each unique key. This is equivalent to doing mapping over the Dataset to extract the keys and then running a distinct operation on those.
Since
1.6.0
abstract def mapGroupsWithState[S, U](timeoutConf: GroupStateTimeout, initialState: KeyValueGroupedDataset[K, S])(func: (K, Iterator[V], GroupState[S]) => U)(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See org.apache.spark.sql.streaming.GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
timeoutConf
Timeout Conf, see GroupStateTimeout for more details
initialState
The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group. To convert a Dataset ds of type Dataset[(K, S)] to a KeyValueGroupedDataset[K, S] do
```
ds.groupByKey(x => x._1).mapValues(_._2)
```
See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
func
Function to be called on every group.
Since
3.2.0
abstract def mapGroupsWithState[S, U](timeoutConf: GroupStateTimeout)(func: (K, Iterator[V], GroupState[S]) => U)(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See org.apache.spark.sql.streaming.GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
func
Function to be called on every group.
Since
2.2.0
abstract def mapGroupsWithState[S, U](func: (K, Iterator[V], GroupState[S]) => U)(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See org.apache.spark.sql.streaming.GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
Since
2.2.0
abstract def mapValues[W](func: (V) => W)(implicit arg0: Encoder[W]): KeyValueGroupedDataset[K, W]
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data.
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data. The grouping key is unchanged by this.
```
// Create values grouped by key from a Dataset[(K, V)]
ds.groupByKey(_._1).mapValues(_._2) // Scala
```
Since
2.1.0
abstract def reduceGroups(f: (V, V) => V): Dataset[(K, V)]
(Scala-specific) Reduces the elements of each group of data using the specified binary function.
(Scala-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.
Since
1.6.0
abstract def transformWithState[U, S](statefulProcessor: StatefulProcessorWithInitialState[K, V, U, S], eventTimeColumnName: String, outputMode: OutputMode, initialState: KeyValueGroupedDataset[K, S])(implicit arg0: Encoder[U], arg1: Encoder[S]): Dataset[U]
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. Functions as the function above, but with additional eventTimeColumnName for output.
U
The type of the output objects. Must be encodable to Spark SQL types.
S
The type of initial state objects. Must be encodable to Spark SQL types. Downstream operators would use specified eventTimeColumnName to calculate watermark. Note that TimeMode is set to EventTime to ensure correct flow of watermark.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
eventTimeColumnName
eventTime column in the output dataset. Any operations after transformWithState will use the new eventTimeColumn. The user needs to ensure that the eventTime for emitted output adheres to the watermark boundary, otherwise streaming query will fail.
outputMode
The output mode of the stateful processor.
initialState
User provided initial state that will be used to initiate state for the query in the first batch. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
abstract def transformWithState[U, S](statefulProcessor: StatefulProcessorWithInitialState[K, V, U, S], timeMode: TimeMode, outputMode: OutputMode, initialState: KeyValueGroupedDataset[K, S])(implicit arg0: Encoder[U], arg1: Encoder[S]): Dataset[U]
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. Functions as the function above, but with additional initial state.
U
The type of the output objects. Must be encodable to Spark SQL types.
S
The type of initial state objects. Must be encodable to Spark SQL types.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
timeMode
The time mode semantics of the stateful processor for timers and TTL.
outputMode
The output mode of the stateful processor.
initialState
User provided initial state that will be used to initiate state for the query in the first batch. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
abstract def transformWithState[U](statefulProcessor: StatefulProcessor[K, V, U], eventTimeColumnName: String, outputMode: OutputMode)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. We allow the user to act on per-group set of input rows along with keyed state and the user can choose to output/return 0 or more rows. For a streaming dataframe, we will repeatedly invoke the interface methods for new rows in each trigger and the user's state/state variables will be stored persistently across invocations.
Downstream operators would use specified eventTimeColumnName to calculate watermark. Note that TimeMode is set to EventTime to ensure correct flow of watermark.
U
The type of the output objects. Must be encodable to Spark SQL types.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
eventTimeColumnName
eventTime column in the output dataset. Any operations after transformWithState will use the new eventTimeColumn. The user needs to ensure that the eventTime for emitted output adheres to the watermark boundary, otherwise streaming query will fail.
outputMode
The output mode of the stateful processor. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
abstract def transformWithState[U](statefulProcessor: StatefulProcessor[K, V, U], timeMode: TimeMode, outputMode: OutputMode)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Scala-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. We allow the user to act on per-group set of input rows along with keyed state and the user can choose to output/return 0 or more rows. For a streaming dataframe, we will repeatedly invoke the interface methods for new rows in each trigger and the user's state/state variables will be stored persistently across invocations.
U
The type of the output objects. Must be encodable to Spark SQL types.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
timeMode
The time mode semantics of the stateful processor for timers and TTL.
outputMode
The output mode of the stateful processor. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.

Concrete Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
def agg[U1, U2, U3, U4, U5, U6, U7, U8](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5], col6: TypedColumn[V, U6], col7: TypedColumn[V, U7], col8: TypedColumn[V, U8]): Dataset[(K, U1, U2, U3, U4, U5, U6, U7, U8)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Since
3.0.0
def agg[U1, U2, U3, U4, U5, U6, U7](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5], col6: TypedColumn[V, U6], col7: TypedColumn[V, U7]): Dataset[(K, U1, U2, U3, U4, U5, U6, U7)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Since
3.0.0
def agg[U1, U2, U3, U4, U5, U6](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5], col6: TypedColumn[V, U6]): Dataset[(K, U1, U2, U3, U4, U5, U6)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Since
3.0.0
def agg[U1, U2, U3, U4, U5](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5]): Dataset[(K, U1, U2, U3, U4, U5)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Since
3.0.0
def agg[U1, U2, U3, U4](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4]): Dataset[(K, U1, U2, U3, U4)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Since
1.6.0
def agg[U1, U2, U3](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3]): Dataset[(K, U1, U2, U3)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Since
1.6.0
def agg[U1, U2](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2]): Dataset[(K, U1, U2)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Since
1.6.0
def agg[U1](col1: TypedColumn[V, U1]): Dataset[(K, U1)]
Computes the given aggregation, returning a Dataset of tuples for each unique key and the result of computing this aggregation over all elements in the group.
Computes the given aggregation, returning a Dataset of tuples for each unique key and the result of computing this aggregation over all elements in the group.
Since
1.6.0
final def asInstanceOf[T0]: T0
Definition Classes
Any
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
def cogroup[U, R](other: KeyValueGroupedDataset[K, U], f: CoGroupFunction[K, V, U, R], encoder: Encoder[R]): Dataset[R]
(Java-specific) Applies the given function to each cogrouped data.
(Java-specific) Applies the given function to each cogrouped data. For each unique group, the function will be passed the grouping key and 2 iterators containing all elements in the group from Dataset this and other. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
Since
1.6.0
def cogroup[U, R](other: KeyValueGroupedDataset[K, U])(f: (K, Iterator[V], Iterator[U]) => IterableOnce[R])(implicit arg0: Encoder[R]): Dataset[R]
(Scala-specific) Applies the given function to each cogrouped data.
(Scala-specific) Applies the given function to each cogrouped data. For each unique group, the function will be passed the grouping key and 2 iterators containing all elements in the group from Dataset this and other. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
Since
1.6.0
def cogroupSorted[U, R](other: KeyValueGroupedDataset[K, U], thisSortExprs: Array[Column], otherSortExprs: Array[Column], f: CoGroupFunction[K, V, U, R], encoder: Encoder[R]): Dataset[R]
(Java-specific) Applies the given function to each sorted cogrouped data.
(Java-specific) Applies the given function to each sorted cogrouped data. For each unique group, the function will be passed the grouping key and 2 sorted iterators containing all elements in the group from Dataset this and other. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This is equivalent to KeyValueGroupedDataset#cogroup, except for the iterators to be sorted according to the given sort expressions. That sorting does not add computational complexity.
Since
3.4.0
See also
org.apache.spark.sql.api.KeyValueGroupedDataset#cogroup
def count(): Dataset[(K, Long)]
Returns a Dataset that contains a tuple with each key and the number of items present for that key.
Returns a Dataset that contains a tuple with each key and the number of items present for that key.
Since
1.6.0
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
def flatMapGroups[U](f: FlatMapGroupsFunction[K, V, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Applies the given function to each group of data.
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
Since
1.6.0
def flatMapGroups[U](f: (K, Iterator[V]) => IterableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data.
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
Since
1.6.0
def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, V, S, U], outputMode: OutputMode, stateEncoder: Encoder[S], outputEncoder: Encoder[U], timeoutConf: GroupStateTimeout, initialState: KeyValueGroupedDataset[K, S]): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
outputMode
The output mode of the function.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type.
timeoutConf
Timeout configuration for groups that do not receive data for a while.
initialState
The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group. To covert a Dataset ds of type of type Dataset[(K, S)] to a KeyValueGroupedDataset[K, S], use
```
ds.groupByKey(x => x._1).mapValues(_._2)
```
See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
Since
3.2.0
def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, V, S, U], outputMode: OutputMode, stateEncoder: Encoder[S], outputEncoder: Encoder[U], timeoutConf: GroupStateTimeout): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
outputMode
The output mode of the function.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
Since
2.2.0
def flatMapSortedGroups[U](SortExprs: Array[Column], f: FlatMapGroupsFunction[K, V, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Applies the given function to each group of data.
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and a sorted iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
This is equivalent to KeyValueGroupedDataset#flatMapGroups, except for the iterator to be sorted according to the given sort expressions. That sorting does not add computational complexity.
Since
3.4.0
See also
org.apache.spark.sql.api.KeyValueGroupedDataset#flatMapGroups
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@IntrinsicCandidate() @native()
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
def mapGroups[U](f: MapGroupsFunction[K, V, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Applies the given function to each group of data.
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
Since
1.6.0
def mapGroups[U](f: (K, Iterator[V]) => U)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data.
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
Since
1.6.0
def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, U], stateEncoder: Encoder[S], outputEncoder: Encoder[U], timeoutConf: GroupStateTimeout, initialState: KeyValueGroupedDataset[K, S]): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type.
timeoutConf
Timeout configuration for groups that do not receive data for a while.
initialState
The user provided state that will be initialized when the first batch of data is processed in the streaming query. The user defined function will be called on the state data even if there are no other values in the group. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
Since
3.2.0
def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, U], stateEncoder: Encoder[S], outputEncoder: Encoder[U], timeoutConf: GroupStateTimeout): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
Since
2.2.0
def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, U], stateEncoder: Encoder[S], outputEncoder: Encoder[U]): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
Since
2.2.0
def mapValues[W](func: MapFunction[V, W], encoder: Encoder[W]): KeyValueGroupedDataset[K, W]
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data.
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data. The grouping key is unchanged by this.
```
// Create Integer values grouped by String key from a Dataset<Tuple2<String, Integer>>
Dataset<Tuple2<String, Integer>> ds = ...;
KeyValueGroupedDataset<String, Integer> grouped =
  ds.groupByKey(t -> t._1, Encoders.STRING()).mapValues(t -> t._2, Encoders.INT());
```
Since
2.1.0
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@IntrinsicCandidate() @native()
def reduceGroups(f: ReduceFunction[V]): Dataset[(K, V)]
(Java-specific) Reduces the elements of each group of data using the specified binary function.
(Java-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.
Since
1.6.0
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
def transformWithState[U, S](statefulProcessor: StatefulProcessorWithInitialState[K, V, U, S], outputMode: OutputMode, initialState: KeyValueGroupedDataset[K, S], eventTimeColumnName: String, outputEncoder: Encoder[U], initialStateEncoder: Encoder[S])(implicit arg0: Encoder[U], arg1: Encoder[S]): Dataset[U]
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. Functions as the function above, but with additional eventTimeColumnName for output.
Downstream operators would use specified eventTimeColumnName to calculate watermark. Note that TimeMode is set to EventTime to ensure correct flow of watermark.
U
The type of the output objects. Must be encodable to Spark SQL types.
S
The type of initial state objects. Must be encodable to Spark SQL types.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
outputMode
The output mode of the stateful processor.
initialState
User provided initial state that will be used to initiate state for the query in the first batch.
eventTimeColumnName
event column in the output dataset. Any operations after transformWithState will use the new eventTimeColumn. The user needs to ensure that the eventTime for emitted output adheres to the watermark boundary, otherwise streaming query will fail.
outputEncoder
Encoder for the output type.
initialStateEncoder
Encoder for the initial state type. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
def transformWithState[U, S](statefulProcessor: StatefulProcessorWithInitialState[K, V, U, S], timeMode: TimeMode, outputMode: OutputMode, initialState: KeyValueGroupedDataset[K, S], outputEncoder: Encoder[U], initialStateEncoder: Encoder[S])(implicit arg0: Encoder[U], arg1: Encoder[S]): Dataset[U]
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. Functions as the function above, but with additional initialStateEncoder for state encoding.
U
The type of the output objects. Must be encodable to Spark SQL types.
S
The type of initial state objects. Must be encodable to Spark SQL types.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
timeMode
The time mode semantics of the stateful processor for timers and TTL.
outputMode
The output mode of the stateful processor.
initialState
User provided initial state that will be used to initiate state for the query in the first batch.
outputEncoder
Encoder for the output type.
initialStateEncoder
Encoder for the initial state type. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
def transformWithState[U](statefulProcessor: StatefulProcessor[K, V, U], eventTimeColumnName: String, outputMode: OutputMode, outputEncoder: Encoder[U])(implicit arg0: Encoder[U]): Dataset[U]
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. We allow the user to act on per-group set of input rows along with keyed state and the user can choose to output/return 0 or more rows.
For a streaming dataframe, we will repeatedly invoke the interface methods for new rows in each trigger and the user's state/state variables will be stored persistently across invocations.
Downstream operators would use specified eventTimeColumnName to calculate watermark. Note that TimeMode is set to EventTime to ensure correct flow of watermark.
U
The type of the output objects. Must be encodable to Spark SQL types.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
eventTimeColumnName
eventTime column in the output dataset. Any operations after transformWithState will use the new eventTimeColumn. The user needs to ensure that the eventTime for emitted output adheres to the watermark boundary, otherwise streaming query will fail.
outputMode
The output mode of the stateful processor.
outputEncoder
Encoder for the output type. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
def transformWithState[U](statefulProcessor: StatefulProcessor[K, V, U], timeMode: TimeMode, outputMode: OutputMode, outputEncoder: Encoder[U])(implicit arg0: Encoder[U]): Dataset[U]
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2.
(Java-specific) Invokes methods defined in the stateful processor used in arbitrary state API v2. We allow the user to act on per-group set of input rows along with keyed state and the user can choose to output/return 0 or more rows. For a streaming dataframe, we will repeatedly invoke the interface methods for new rows in each trigger and the user's state/state variables will be stored persistently across invocations.
U
The type of the output objects. Must be encodable to Spark SQL types.
statefulProcessor
Instance of statefulProcessor whose functions will be invoked by the operator.
timeMode
The time mode semantics of the stateful processor for timers and TTL.
outputMode
The output mode of the stateful processor.
outputEncoder
Encoder for the output type. See org.apache.spark.sql.Encoder for more details on what types are encodable to Spark SQL.
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable]) @Deprecated
Deprecated
(Since version 9)

Packages

KeyValueGroupedDataset

abstract class KeyValueGroupedDataset[K, V] extends Serializable

Instance Constructors

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

KeyValueGroupedDataset

abstract class KeyValueGroupedDataset[K, V] extends Serializable

Instance Constructors

Abstract Value Members

Concrete Value Members

Deprecated Value Members

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped

KeyValueGroupedDataset