[Apache Flink]

기타

[Apache Flink] - 개요

파커초 2024. 3. 16. 17:49

개요

공식문서: flink.apache.org

Flink is an open-source stream processing framework for distributed, high-performing, always-available, and accurate

Streaming first application
- Streaming-first
- Fault-tolerant
- Scalable
- Performance

What is stream processing

Receive, Process, React in Real Time!!

Data Processing 은 아래와 같이 두종류로 나뉨

지금 저장하고 나중에 분석 (Batch processing)
- Finite unchanged datasets (week, month, year)
- 과거로부터 특징을 얻기 위해 어떤 프로덕트/프로모션 특성이 잘 동작했는지 (Learn from the past)
- Take big picture decisions
- Periodic reports
- Analyze trends
- Compare slices

받고 프로세싱하고 실시간으로 반응 (Streaming Processing)
- Infinite datasets which are added to continuously (Unbounded Datasets)
- Runs constantly as long as data is received (Continuous processing)
- Monitor and track events
- Take fast, tactical decisions
- Process individual events

Requirements of a Stream processing

이벤트가 도착할때마다 Filter, Transform, combine 오퍼레이션이 필요함
메세지 전송 요구사항
- Beffer event data
- Performant and persistent
- Decoupling multiple sources from processing
스트리밍 아키텍처에서 데이터 일관성 보장
- Checkpointing/periodic backups to manage failures
- Deal with out or order events
- Replay streams when required
총 요구사항
- 고가용, 낮은 지연
- Fault tolerance with low overhead
- Manage out of order events
- Easy to use, maintainable
- Replay streams

Flink Architecture

Stream-first (Flink)
- process 1 event at a time
- batch processing as a special case of stream processing
- deals with out of order data (순서가 맞지 않는 데이터 처리)
- 한번만 실행되는 것을 보장 (Exactly once processing)
- Great control over Windowing
  - window 단위로 operation 가능
- Lightweight fault tolerance and checkpointing
- Distributed (분산 처리 가능)

Parallelism in Flink

Stream processing 의 단계는 아래와 같음

(DataSource → Transformation → Data Sink) [Streaming Processing App]

여러 이벤트가 Stream processing App 에 들어올때 Transformation 로직이 복잡하면 복잡할 수록 이벤트 결과가 늦게 생성되어 밀릴 수 있음. 이를 해결하기 위해 Flink 는 Parellelism 을 지원함

각 Task 들은 서로 다른 스레드에서 동시에 실행됨

JobManger(Leader and standby)

Central co-ordinating process
Schedule tasks

TaskManagers

Tasks get assigned to workers
Run sub-tasks on separate threads

Data Representation

DataStream
- Unbounded
Dataset
- Bounded
두 데이터 모두 transformed 는 가능하지만 immutable 함
- Java Tuples, Scala Case Classes
- Java POJOs
- Primitive Types
- Regular Classes
- Values
- Hadoop Writable

Flink Transformations

filter
map
flatMap
reduce
sum
Combine / aggregate
keyBy
groupBy
Group data based on a key(s)

Lazy Evaluation

개발자가 operation chain 명시
해당 operation 이 플랜에 추가
execute() 메서드가 호출될 때 실행

Flink Program

동작 순서

환경 설정
Source 로 부터 데이터 로드
데이터 변환
Write data to sink
Trigger Execution (env.execute())

Functions

FilterFunction

FilterFunction 의 리턴 값이 False 이면 해당 Stream 을 날려버림

MapFunction

JS 의 map 과 동일함

FlatMapFunction

map 과 비슷한데 collector 개념을 사용해서 하나의 스트림을 여러개의 스트림으로 나눌 수 있음

Stateless vs Stateful Transformation

Stateless Transformation
- 하나의 인풋에 변환을 적용하는것
- filter, map, flatmap
Stateful Transformation
- 여러개의 인풋에 변환을 적용하여 하나의 아웃풋을 만드는 것
- reduce, sum
- Accumulate data
  - entire stream
  - window
  - pre key, per operator

Streams

Keyed Streams

key 값에 따라서 stream 을 논리적으로 분리할 수 있는 스트림
예를 들어서 위와 같은 tuple 데이터를 0번 인덱스 기준으로 묶고 1번 인덱스 기준으로 sum 연산 적용가능

Number Aggregations

sum,min,max

ReduceFunction

Applies on a Keyed Stream
JS 의 Reduce 와 동일함
- first: input
- second: cumulative