Senior Software Engineer — ML Data Platform
DuckDuckGoose
Software Engineering, Data Science
South Holland, Netherlands
Posted on Aug 16, 2025
Senior Software Engineer — ML Data Platform
Location: Delft (hybrid)
Type: Full-time
Start: ASAP
We protect citizens, enterprises, and governments from synthetic media fraud. Everything you see and hear online can now be manipulated — our job is to make sure people can trust what they see. As part of our forensics platform team, you’ll work on the data backbone that makes large-scale detection possible, from ingestion and versioning to training, evaluation, and production.
You’ll join a small, senior team where your work will have immediate impact, and you’ll have ownership over the systems you build.
What You’ll Drive- Data platform architecture: Define unified schemas, lineage, and dataset versioning for large image/video + context data.
- Ingestion at scale: Build reliable pipelines from research repos, APIs, and internal generators; automate connectors and jobs.
- Quality & governance: Implement deduplication, validation, health dashboards, and drift/coverage checks with auditable lineage.
- Curation & access: Deliver one-command dataset builds, deterministic splits, and fast sampling tools for training/eval.
- Performance & cost: Tune S3/object storage layouts, partitioning, and lifecycle policies for speed and spend.
- Orchestration & ops: Productionize pipelines with CI/CD, containerization, scheduling/monitoring, and safe rollbacks.
- Reliability & operations: Build for simplicity and observability; participate in a planned, compensated support rotation.
- Engineering productivity: Create internal tools/CLIs, docs, and templates that make everyone faster.
- Strong software engineering foundation: Master’s in Computer Science, Data Engineering, or a related field.
- Production experience: 5–8+ years building and operating data platforms for large unstructured datasets (images/video).
- Data lifecycle ownership: Ingest → validate → catalog → version → sample/serve → monitor.
- Pipelines & orchestration: Experience with modern schedulers (e.g., Airflow/Prefect) and containerized jobs.
- Storage & formats: Hands-on with object storage (e.g., S3), columnar formats/partitioning, and performance tuning.
- Versioning & lineage: Experience with dataset versioning and reproducibility (e.g., DVC/lakeFS/Delta or equivalents).
- Quality at scale: Deduplication, schema/label checks, and automated QC gates in CI.
- Security & privacy: IAM, access controls, and privacy-aware workflows suitable for regulated customers.
- Domain awareness: Familiarity with digital forensics, misinformation threats, or synthetic media — and willingness to deepen expertise.
- Flexibility: Comfortable moving between data engineering, infra, and tooling tasks when needed.
- Mindset & delivery: Thrive in a fast-moving environment; proactive problem-solver; ship, measure, simplify.
- Communication: Excellent written and verbal skills; explain complex ideas clearly.
- Independence: Deliver quality work on time without constant oversight.
- Language: Fluent in English.
- Streaming & events: Kafka/Kinesis or similar for near-real-time ingestion.
- Vector search: Experience with embedding stores or similarity search at scale.
- Synthetic data: Building pipelines to generate/stress-test rare scenarios.
- Cloud & on-prem: Terraform/CDK, Kubernetes, and hybrid/on-prem data deployments.
- FinOps: Cost monitoring and optimization for data workloads.
- Technical track record: Strong GitHub, open-source contributions, publications, patents, or public talks.
- Leadership: Mentoring and guiding technical direction.
- Dutch language: Fluency is a plus.
- A unified schema + catalog with key datasets onboarded, versioned, and reproducibly built via one command.
- Automated QC gates (dedup/validation) with a red/amber/green dataset health dashboard and clear lineage.
- Fast sampling/curation tools for the ML team, plus cost controls (storage layouts, lifecycle policies) in place.
- Data migration: Inventory and migrate existing/legacy datasets into the new platform; reformat to the new schema, backfill metadata, validate checksums/lineage, and deprecate legacy paths with a rollback plan.
- Own the backbone: Define schemas, lineage, and dataset versioning used across research and production.
- Company participation: Meaningful equity/virtual shares aligned with company growth.
- Flexible work: Hybrid (Delft), flexible hours, minimal ceremony, async-first collaboration.
- Data platform mandate: Real say in stack choices (orchestration, catalog, storage/layout) and time to implement them right.
- Repro & auditability: Space to enforce deterministic builds, splits, and traceable lineage—no heroics needed.
- Quality culture: Backing to implement dedup, drift/coverage checks, and dataset health dashboards org-wide.
- FinOps mindset: Budget and support to balance speed, reliability, and total cost.
- Pragmatic on-call: Planned, compensated rotation with automation-first recovery and rollback plans.
- Growth path: IC track to Staff/Principal; opportunities to mentor and codify data standards.
- Learning budget: Annual budget for courses/books + two data/ML-infra conferences per year.
- Home office: Modest stipend for an ergonomic setup; commuting support (public transport or mileage).
- Relocation + visa: Visa sponsorship and relocation support for internationals.
Join us and be part of a company committed to creating a more secure and trustworthy digital future. Apply today to become part of our mission-driven team!