AI Data Engineer

Engineering

|

Data Engineer - Generative AI & RAG Systems

Job Description

We are seeking a specialized Data Engineer to design, build, and optimize data preparation and ingestion pipelines for Retrieval-Augmented Generation (RAG) systems and other Generative AI applications. This role focuses exclusively on the critical data foundation that powers AI systems—transforming raw, unstructured data into AI-ready formats through sophisticated parsing, chunking, embedding, and vector storage workflows.

You will work at the intersection of traditional data engineering and cutting-edge AI technologies, building scalable pipelines that process diverse document types (PDFs, Word docs, HTML, JSON) and prepare them for semantic search and retrieval. This is not a model training or tuning role, but rather the essential data infrastructure that enables AI systems to access and utilize private, domain-specific knowledge effectively.

Responsibilities

RAG Data Pipeline Architecture & Development

Design and implement end-to-end data pipelines for RAG systems following industry-standard multi-stage processing: Raw Data → Ingestion → Parsing → Enrichment → Chunking → Embedding → Vector Storage → Retrieval
Build scalable document processing workflows that handle diverse file formats including PDFs, Word documents, HTML, JSON, XML, and multimedia content
Develop robust parsing logic using libraries like PyPDF2, unstructured, BeautifulSoup, and OCR tools (Tesseract, Amazon Textract, Azure AI Vision) for accurate content extraction
Implement intelligent chunking strategies including fixed-size, semantic, and overlapping techniques optimized for retrieval performance and context preservation

Vector Database Management & Optimization

Design and implement vector storage solutions using modern vector databases (Pinecone, Milvus, Weaviate, Chroma, Qdrant) for efficient similarity search and retrieval
Optimize embedding generation workflows using sentence transformers and embedding APIs, ensuring consistent vector representations across document collections
Implement hybrid search capabilities combining vector similarity with traditional keyword search for enhanced retrieval accuracy
Monitor and tune vector database performance including indexing strategies, query optimization, and resource utilization

Data Quality & Enrichment Systems

Develop comprehensive data validation frameworks ensuring high-quality input for AI systems, including schema validation, content quality checks, and automated error detection
Implement sophisticated deduplication algorithms at both document and chunk levels to eliminate redundant information and improve retrieval efficiency
Build metadata extraction and enrichment pipelines that automatically generate document-level, content-based, structural, and contextual metadata for enhanced searchability
Create data lineage tracking systems ensuring complete traceability from raw sources to vector storage for compliance and debugging

Production Pipeline Operations & Monitoring

Build and maintain production-grade data pipelines with 99.9%+ uptime requirements, implementing robust error handling, retry mechanisms, and graceful degradation
Develop comprehensive monitoring and alerting systems for data quality, pipeline performance, processing latency, and system health metrics
Implement incremental processing capabilities that efficiently handle new data without requiring full reprocessing of existing document collections
Create automated testing frameworks for data pipeline validation, including unit tests, integration tests, and end-to-end workflow verification

Cross-Functional Collaboration & Technical Leadership

Partner closely with AI/ML engineers, product managers, and business stakeholders to understand data requirements and optimize pipeline performance for specific use cases
Collaborate with infrastructure teams to ensure scalable, cost-effective deployment of data processing workflows across cloud platforms (AWS, Azure, GCP)
Provide technical guidance on data architecture decisions, tool selection, and best practices for RAG system implementation
Document technical specifications, data schemas, and operational procedures for knowledge sharing and team onboarding

Requirements

Education & Experience

Bachelor's degree in Computer Science, Data Engineering, Software Engineering, or related technical field
3+ years of experience in data engineering, data pipeline development, or related roles with production system responsibility
1+ years of hands-on experience with AI/ML systems, particularly RAG implementations, vector databases, or document processing workflows
Proven track record of building and maintaining large-scale data processing systems handling millions of documents or records

Core Data Engineering Expertise

Programming Proficiency: Expert-level Python skills with experience in data processing libraries (Pandas, NumPy, Scikit-learn), plus working knowledge of SQL, Java, or Scala
Data Processing Frameworks: Hands-on experience with Apache Spark, Apache Airflow, Apache Kafka, or similar distributed processing and orchestration tools
Database Technologies: Strong experience with both relational databases (PostgreSQL, MySQL) and NoSQL systems (MongoDB, Cassandra, Elasticsearch)
Cloud Platform Expertise: Practical experience with at least one major cloud platform (AWS, Azure, GCP) including data services, storage solutions, and compute resources

AI/RAG-Specific Technical Skills

Vector Database Experience: Hands-on implementation experience with vector databases such as Pinecone, Milvus, Weaviate, Chroma, or Qdrant
Document Processing: Proven experience with document parsing libraries and tools including PyPDF2, unstructured, BeautifulSoup, lxml, and OCR technologies
RAG Framework Knowledge: Familiarity with RAG development frameworks such as LangChain, LlamaIndex, or custom pipeline implementations
Embedding Technologies: Understanding of sentence transformers, embedding models, and vector similarity concepts for semantic search applications

Infrastructure & DevOps Skills

Containerization: Experience with Docker and Kubernetes for scalable application deployment and orchestration
CI/CD Pipelines: Proficiency with automated testing, deployment, and monitoring using tools like GitHub Actions, Jenkins, or similar platforms
Infrastructure as Code: Experience with Terraform, CloudFormation, or similar tools for reproducible infrastructure management
Monitoring & Observability: Hands-on experience with logging, monitoring, and alerting systems for production data pipelines

Preferred Skills

Preferred Qualifications

Advanced Technical Experience

Master's degree in Computer Science, Data Science, or related field with focus on distributed systems or machine learning
5+ years of data engineering experience with at least 2 years specifically in AI/ML data pipeline development
Experience with real-time data processing and streaming architectures for dynamic RAG systems
Knowledge of multimodal data processing including text, images, audio, and video content for comprehensive RAG implementations

Specialized Domain Knowledge

Experience with enterprise data governance frameworks, data privacy regulations (GDPR, HIPAA), and compliance requirements
Background in natural language processing (NLP) concepts, text preprocessing, and linguistic analysis techniques
Familiarity with MLOps practices including model versioning, experiment tracking, and automated model deployment pipelines
Experience with federated learning or distributed AI systems across multiple data sources and organizations

Industry & Leadership Experience

Previous experience in AI-first companies, technology consulting, or enterprise AI transformation initiatives
Track record of mentoring junior engineers and contributing to technical decision-making processes
Experience presenting technical concepts to non-technical stakeholders and translating business requirements into technical specifications
Contributions to open-source projects related to data engineering, AI/ML, or RAG system development

Success Metrics & Performance Indicators

Technical Excellence

Data Quality: Achieve 99%+ accuracy in document processing and content extraction across diverse file formats
Pipeline Reliability: Maintain 99.9%+ uptime for critical data processing workflows with minimal manual intervention
Processing Performance: Optimize pipeline throughput to handle increasing data volumes while maintaining sub-second retrieval times
Cost Efficiency: Implement resource optimization strategies that reduce cloud infrastructure costs by 20%+ while maintaining performance

Business Impact & Innovation

RAG System Performance: Contribute to measurable improvements in retrieval accuracy, relevance scores, and user satisfaction metrics
Time to Market: Reduce time required to onboard new data sources from weeks to days through automated pipeline development
Scalability Achievement: Successfully scale data processing capabilities to support 10x growth in document volume and user queries
Knowledge Sharing: Create reusable frameworks, documentation, and best practices that improve team productivity by 25%+

Tools & Technologies

Data Processing & Pipeline Development

Programming Languages: Python (primary), SQL (expert), Java/Scala (working knowledge)
Data Processing: Apache Spark, Apache Airflow, Apache Kafka, Pandas, NumPy
Workflow Orchestration: Airflow, Prefect, Dagster, or custom scheduling systems
Data Validation: Great Expectations, Deequ, custom validation frameworks

AI/ML & Vector Technologies

Vector Databases: Pinecone, Milvus, Weaviate, Chroma, Qdrant, FAISS
Document Processing: PyPDF2, unstructured, BeautifulSoup, lxml, pdfplumber
OCR & Content Extraction: Tesseract, Amazon Textract, Azure AI Vision, Google Cloud Vision API
RAG Frameworks: LangChain, LlamaIndex, Haystack, custom implementations

Cloud & Infrastructure

Cloud Platforms: AWS (S3, Lambda, ECS, RDS), Azure (Blob Storage, Functions, AKS), GCP (Cloud Storage, Cloud Functions, GKE)
Containerization: Docker, Kubernetes, container registries
Infrastructure as Code: Terraform, CloudFormation, Pulumi
Monitoring: Prometheus, Grafana, CloudWatch, Azure Monitor, Google Cloud Monitoring

Development & Collaboration

Version Control: Git, GitHub, GitLab with branching strategies and code review processes
CI/CD: GitHub Actions, Jenkins, Azure DevOps, GitLab CI
Documentation: Confluence, Notion, Sphinx, MkDocs
Communication: Slack, Microsoft Teams, Jira, Linear