Docs: Modern Data Stack (MDS) reference for Smart City
- Data Ingestion: NiFi, Airbyte, Kafka, Flink, dlt - Workflow Automation: Airflow, Kestra, n8n, OpenFN, Dagster - Analytics & Transformation: dbt, Spark, RisingWave, Druid, ClickHouse - BI & Visualization: Grafana, Superset, DataHub, Great Expectations - Storage: MinIO, PostgreSQL/TimescaleDB, CrateDB, Iceberg, InfluxDB - Architecture MVP et Enterprise pour Smart City Martinique
This commit is contained in:
191
references/modern-data-stack.md
Normal file
191
references/modern-data-stack.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# Modern Data Stack (MDS) - Smart City Digital Twin
|
||||
|
||||
## Vue d'ensemble
|
||||
|
||||
Stack moderne pour l'ingestion, le traitement, l'orchestration et la visualisation des données IoT du jumeau numérique Smart City (Martinique).
|
||||
|
||||
## 1. Data Ingestion (Ingestion de données)
|
||||
|
||||
### Objectif
|
||||
Ingérer les données des capteurs IoT (AirQualityObserved, TrafficFlowObserved, WeatherObserved, etc.) depuis les brokers MQTT et les Context Brokers (Orion-LD, Stellio, FROST).
|
||||
|
||||
### Outils identifiés
|
||||
|
||||
| Outil | Rôle | Avantage Smart City | Skill disponible |
|
||||
|-------|------|---------------------|------------------|
|
||||
| **Apache NiFi** | Ingestion visuelle, routage, transformation | Drag-and-drop flows, gestion erreurs, replay | `apache-nifi` / `apache-nifi-workflow` |
|
||||
| **Airbyte** | ELT open-source, 300+ connecteurs | Connecteurs FIWARE, MQTT, PostgreSQL, InfluxDB | `airbyte-data-ingestion` |
|
||||
| **Kafka / Redpanda** | Event streaming, buffer de messages | Découplage producteurs/consommateurs IoT | `apache-kafka`, `redpanda` |
|
||||
| **Flink** | Stream processing (real-time) | Windowing, agrégations temporelles capteurs | `apache-flink`, `apache-kafka-flink-streaming` |
|
||||
| **dlt (data load tool)** | ETL Python simple | Léger, transformation inline, pas de lourdeur | `dlt` |
|
||||
|
||||
### Architecture proposée (Ingestion)
|
||||
```
|
||||
MQTT Brokers (EMQX, BunkerM)
|
||||
↓
|
||||
Apache NiFi (routage, nettoyage, validation)
|
||||
↓
|
||||
Kafka / Redpanda (buffer, replay)
|
||||
↓
|
||||
Context Brokers (Orion-LD, Stellio, FROST)
|
||||
↓
|
||||
Data Lake (MinIO) / Data Warehouse (ClickHouse)
|
||||
```
|
||||
|
||||
## 2. Workflow Automation (Orchestration)
|
||||
|
||||
### Objectif
|
||||
Orchestrer les pipelines de données, les tâches de maintenance, et les synchronisations entre les différents composants.
|
||||
|
||||
### Outils identifiés
|
||||
|
||||
| Outil | Rôle | Avantage Smart City | Skill disponible |
|
||||
|-------|------|---------------------|------------------|
|
||||
| **Apache Airflow** | Orchestration DAGs, scheduling | Standard industrie, Python, monitoring | `apache-airflow` |
|
||||
| **Kestra** | Event-driven orchestration | YAML-native, UI moderne, moins de code | `kestra` |
|
||||
| **n8n** | Workflow automation no-code/low-code | Intégration rapide, Webhooks, API | `n8n` |
|
||||
| **OpenFN** | DPG pour automation gouvernementale | Alignement DPI, services publics | `openfn` |
|
||||
| **Dagster** | Modern orchestration, assets-focused | Lineage, testabilité, modern alt to Airflow | `dagster` |
|
||||
|
||||
### Architecture proposée (Workflows)
|
||||
```
|
||||
Triggers (Timer, Webhook, MQTT Event)
|
||||
↓
|
||||
Kestra / Airflow (DAG orchestration)
|
||||
├→ Data Ingestion (NiFi / Airbyte)
|
||||
├→ Context Broker Sync (Orion ↔ Stellio)
|
||||
├→ Data Quality Checks (Great Expectations)
|
||||
├→ Transformation (dbt)
|
||||
└→ Notification (Telegram, Email)
|
||||
```
|
||||
|
||||
## 3. Data Analytics & Transformation
|
||||
|
||||
### Objectif
|
||||
Transformer, nettoyer, et modéliser les données pour l'analyse (SQL, Python).
|
||||
|
||||
### Outils identifiés
|
||||
|
||||
| Outil | Rôle | Avantage Smart City | Skill disponible |
|
||||
|-------|------|---------------------|------------------|
|
||||
| **dbt (data build tool)** | SQL transformations, tests, documentation | Standard MDS, versioning, modularité | `dbt-core`, `dbt-transformation` |
|
||||
| **Apache Spark** | Batch/Stream processing distribué | Gros volumes, ML préparation | `apache-spark` |
|
||||
| **RisingWave** | Streaming database (PostgreSQL-compatible) | Requêtes SQL sur streams temps réel | `risingwave` |
|
||||
| **Apache Druid** | Real-time OLAP analytics | Sub-second queries, séries temporelles | `apache-druid` |
|
||||
| **ClickHouse** | Columnar OLAP database | Analytics rapide, compression, IoT | `clickhouse-analytics-db` |
|
||||
|
||||
### Architecture proposée (Analytics)
|
||||
```
|
||||
Raw Data (Kafka / Context Brokers)
|
||||
↓
|
||||
dbt (staging → intermediate → marts)
|
||||
↓
|
||||
Analytics DB (ClickHouse / Druid / RisingWave)
|
||||
↓
|
||||
Dashboards (Grafana / Superset)
|
||||
```
|
||||
|
||||
## 4. Data Visualization & BI (Business Intelligence)
|
||||
|
||||
### Objectif
|
||||
Créer des tableaux de bord pour monitorer la qualité de l'air, le trafic, la météo, et l'état du jumeau numérique.
|
||||
|
||||
### Outils identifiés
|
||||
|
||||
| Outil | Rôle | Avantage Smart City | Skill disponible |
|
||||
|-------|------|---------------------|------------------|
|
||||
| **Grafana** | Metrics, monitoring, alerting | Déjà utilisé (digital-twin stack), séries temporelles | `grafana-superset-dashboards` |
|
||||
| **Apache Superset** | BI moderne, SQL Lab, charts | Open-source, self-hosted, pas de licence | `superset`, `grafana-superset-dashboards` |
|
||||
| **DataHub** | Data catalog, metadata management | Traçabilité données, lineage, découverte | `datahub`, `openmetadata` |
|
||||
| **Great Expectations** | Data quality testing | Tests automatisés, profilage, alerting | `great-expectations-data-quality` |
|
||||
|
||||
### Architecture proposée (BI)
|
||||
```
|
||||
Analytics DB (ClickHouse / PostgreSQL + Timescale)
|
||||
↓
|
||||
Grafana (temps réel, alerting) + Superset (analyse ad-hoc)
|
||||
↓
|
||||
Data Catalog (DataHub) pour gouvernance
|
||||
```
|
||||
|
||||
## 5. Data Storage (Stockage)
|
||||
|
||||
### Outils identifiés
|
||||
|
||||
| Outil | Type | Usage Smart City |
|
||||
|-------|------|------------------|
|
||||
| **MinIO** | S3-compatible object storage | Data Lake (raw, processed) |
|
||||
| **PostgreSQL + PostGIS + TimescaleDB** | RDBMS + Spatial + Time-series | Stockage relationnel, géospatial, IoT |
|
||||
| **CrateDB** | Distributed SQL (IoT/time-series) | Requêtes distribuées IoT |
|
||||
| **Apache Iceberg / Delta Lake** | Open table formats | ACID transactions, time travel |
|
||||
| **InfluxDB** | Time-series DB | Déjà utilisé dans le projet |
|
||||
|
||||
## 6. Recommandation d'architecture MDS (Smart City Martinique)
|
||||
|
||||
### Stack minimale (MVP)
|
||||
```
|
||||
1. Ingestion → Apache NiFi (visual flows, MQTT → Context Brokers)
|
||||
2. Orchestration → Kestra (YAML, event-driven, moins de code)
|
||||
3. Transformation → dbt (SQL, versioning, tests)
|
||||
4. Analytics DB → ClickHouse (rapide, colonnaire, IoT-friendly)
|
||||
5. Visualization → Grafana (existant) + Apache Superset (BI)
|
||||
6. Storage → MinIO (Data Lake) + PostgreSQL/TimescaleDB (relationnel)
|
||||
7. Data Catalog → DataHub (métadonnées, lineage)
|
||||
8. Quality → Great Expectations (tests qualité)
|
||||
```
|
||||
|
||||
### Stack complète (Enterprise)
|
||||
```
|
||||
+ Apache Kafka / Redpanda (Event streaming backbone)
|
||||
+ Apache Flink (Real-time stream processing)
|
||||
+ Apache Airflow (Complex DAG orchestration)
|
||||
+ Apache Spark (Big data processing)
|
||||
+ Apache Druid (Real-time OLAP)
|
||||
+ OpenMetadata (Data governance)
|
||||
+ MindsDB (ML in database)
|
||||
```
|
||||
|
||||
## 7. Alignement avec l'existant (digitribe.fr)
|
||||
|
||||
### Réutilisation
|
||||
- **Traefik** : Reverse proxy pour tous les composants MDS (NiFi, Airflow, Superset, etc.)
|
||||
- **Docker Compose** : Containerisation de la stack MDS
|
||||
- **PostgreSQL** : Déjà utilisé (Orion-LD, Stellio, FROST, OpenRemote) → Ajouter TimescaleDB
|
||||
- **InfluxDB** : Déjà utilisé → Compléter avec ClickHouse ou Druid
|
||||
- **MQTT (EMQX)** : Source d'ingestion principale
|
||||
- **Grafana** : Déjà utilisé → Étendre avec Superset
|
||||
|
||||
### Intégration Context Brokers
|
||||
```
|
||||
Context Brokers (Orion-LD, Stellio, FROST)
|
||||
↓ (HTTP API / NGSI-LD)
|
||||
Airbyte (connector custom) ou NiFi (InvokeHTTP)
|
||||
↓
|
||||
Kafka (topic par type d'entité: airquality, traffic, weather)
|
||||
↓
|
||||
dbt (modélisation)
|
||||
↓
|
||||
ClickHouse (analytics)
|
||||
↓
|
||||
Grafana / Superset (dashboards)
|
||||
```
|
||||
|
||||
## 8. Prochaines étapes
|
||||
|
||||
1. **Choisir les composants MVP** : NiFi vs Airbyte, Kestra vs Airflow, ClickHouse vs Druid
|
||||
2. **Déployer un POC** : Docker Compose avec 2-3 composants clés
|
||||
3. **Créer les pipelines** : MQTT → Context Broker → Analytics DB → Dashboard
|
||||
4. **Documentation** : Guides d'installation, configurations Traefik, tests
|
||||
|
||||
## 9. Ressources
|
||||
|
||||
- MDS Landscape : https://datatechnologylifecycle.com/modern-data-stack-landscape/
|
||||
- Airbyte Docs : https://docs.airbyte.com/
|
||||
- Apache NiFi Docs : https://nifi.apache.org/docs.html
|
||||
- dbt Docs : https://docs.getdbt.com/
|
||||
- ClickHouse Docs : https://clickhouse.com/docs
|
||||
- Grafana Stack : https://grafana.com/docs/
|
||||
- Apache Superset : https://superset.apache.org/docs/
|
||||
|
||||
---
|
||||
*Document créé le 2026-05-05 pour le projet Smart City Digital Twin (Martinique)*
|
||||
Reference in New Issue
Block a user