Job Description
We are seeking a specialized Data Engineer to design, build, and optimize data preparation and ingestion pipelines for Retrieval-Augmented Generation (RAG) systems and other Generative AI applications. This role focuses exclusively on the critical data foundation that powers AI systems—transforming raw, unstructured data into AI-ready formats through sophisticated parsing, chunking, embedding, and vector storage workflows.
You will work at the intersection of traditional data engineering and cutting-edge AI technologies, building scalable pipelines that process diverse document types (PDFs, Word docs, HTML, JSON) and prepare them for semantic search and retrieval. This is not a model training or tuning role, but rather the essential data infrastructure that enables AI systems to access and utilize private, domain-specific knowledge effectively.
Responsibilities
RAG Data Pipeline Architecture & Development
Design and implement end-to-end data pipelines for RAG systems following industry-standard multi-stage processing:
Raw Data → Ingestion → Parsing → Enrichment → Chunking → Embedding → Vector Storage → RetrievalBuild scalable document processing workflows that handle diverse file formats including PDFs, Word documents, HTML, JSON, XML, and multimedia content
Develop robust parsing logic using libraries like PyPDF2, unstructured, BeautifulSoup, and OCR tools (Tesseract, Amazon Textract, Azure AI Vision) for accurate content extraction
Implement intelligent chunking strategies including fixed-size, semantic, and overlapping techniques optimized for retrieval performance and context preservation
Vector Database Management & Optimization
Design and implement vector storage solutions using modern vector databases (Pinecone, Milvus, Weaviate, Chroma, Qdrant) for efficient similarity search and retrieval
Optimize embedding generation workflows using sentence transformers and embedding APIs, ensuring consistent vector representations across document collections
Implement hybrid search capabilities combining vector similarity with traditional keyword search for enhanced retrieval accuracy
Monitor and tune vector database performance including indexing strategies, query optimization, and resource utilization
Data Quality & Enrichment Systems
Develop comprehensive data validation frameworks ensuring high-quality input for AI systems, including schema validation, content quality checks, and automated error detection
Implement sophisticated deduplication algorithms at both document and chunk levels to eliminate redundant information and improve retrieval efficiency
Build metadata extraction and enrichment pipelines that automatically generate document-level, content-based, structural, and contextual metadata for enhanced searchability
Create data lineage tracking systems ensuring complete traceability from raw sources to vector storage for compliance and debugging
Production Pipeline Operations & Monitoring
Build and maintain production-grade data pipelines with 99.9%+ uptime requirements, implementing robust error handling, retry mechanisms, and graceful degradation
Develop comprehensive monitoring and alerting systems for data quality, pipeline performance, processing latency, and system health metrics
Implement incremental processing capabilities that efficiently handle new data without requiring full reprocessing of existing document collections
Create automated testing frameworks for data pipeline validation, including unit tests, integration tests, and end-to-end workflow verification
Cross-Functional Collaboration & Technical Leadership
Partner closely with AI/ML engineers, product managers, and business stakeholders to understand data requirements and optimize pipeline performance for specific use cases
Collaborate with infrastructure teams to ensure scalable, cost-effective deployment of data processing workflows across cloud platforms (AWS, Azure, GCP)
Provide technical guidance on data architecture decisions, tool selection, and best practices for RAG system implementation
Document technical specifications, data schemas, and operational procedures for knowledge sharing and team onboarding
Requirements
Education & Experience
Bachelor's degree in Computer Science, Data Engineering, Software Engineering, or related technical field
3+ years of experience in data engineering, data pipeline development, or related roles with production system responsibility
1+ years of hands-on experience with AI/ML systems, particularly RAG implementations, vector databases, or document processing workflows
Proven track record of building and maintaining large-scale data processing systems handling millions of documents or records
Core Data Engineering Expertise
Programming Proficiency: Expert-level Python skills with experience in data processing libraries (Pandas, NumPy, Scikit-learn), plus working knowledge of SQL, Java, or Scala
Data Processing Frameworks: Hands-on experience with Apache Spark, Apache Airflow, Apache Kafka, or similar distributed processing and orchestration tools
Database Technologies: Strong experience with both relational databases (PostgreSQL, MySQL) and NoSQL systems (MongoDB, Cassandra, Elasticsearch)
Cloud Platform Expertise: Practical experience with at least one major cloud platform (AWS, Azure, GCP) including data services, storage solutions, and compute resources
AI/RAG-Specific Technical Skills
Vector Database Experience: Hands-on implementation experience with vector databases such as Pinecone, Milvus, Weaviate, Chroma, or Qdrant
Document Processing: Proven experience with document parsing libraries and tools including PyPDF2, unstructured, BeautifulSoup, lxml, and OCR technologies
RAG Framework Knowledge: Familiarity with RAG development frameworks such as LangChain, LlamaIndex, or custom pipeline implementations
Embedding Technologies: Understanding of sentence transformers, embedding models, and vector similarity concepts for semantic search applications
Infrastructure & DevOps Skills
Containerization: Experience with Docker and Kubernetes for scalable application deployment and orchestration
CI/CD Pipelines: Proficiency with automated testing, deployment, and monitoring using tools like GitHub Actions, Jenkins, or similar platforms
Infrastructure as Code: Experience with Terraform, CloudFormation, or similar tools for reproducible infrastructure management
Monitoring & Observability: Hands-on experience with logging, monitoring, and alerting systems for production data pipelines
Preferred Skills
Preferred Qualifications
Advanced Technical Experience
Master's degree in Computer Science, Data Science, or related field with focus on distributed systems or machine learning
5+ years of data engineering experience with at least 2 years specifically in AI/ML data pipeline development
Experience with real-time data processing and streaming architectures for dynamic RAG systems
Knowledge of multimodal data processing including text, images, audio, and video content for comprehensive RAG implementations
Specialized Domain Knowledge
Experience with enterprise data governance frameworks, data privacy regulations (GDPR, HIPAA), and compliance requirements
Background in natural language processing (NLP) concepts, text preprocessing, and linguistic analysis techniques
Familiarity with MLOps practices including model versioning, experiment tracking, and automated model deployment pipelines
Experience with federated learning or distributed AI systems across multiple data sources and organizations
Industry & Leadership Experience
Previous experience in AI-first companies, technology consulting, or enterprise AI transformation initiatives
Track record of mentoring junior engineers and contributing to technical decision-making processes
Experience presenting technical concepts to non-technical stakeholders and translating business requirements into technical specifications
Contributions to open-source projects related to data engineering, AI/ML, or RAG system development
Success Metrics & Performance Indicators
Technical Excellence
Data Quality: Achieve 99%+ accuracy in document processing and content extraction across diverse file formats
Pipeline Reliability: Maintain 99.9%+ uptime for critical data processing workflows with minimal manual intervention
Processing Performance: Optimize pipeline throughput to handle increasing data volumes while maintaining sub-second retrieval times
Cost Efficiency: Implement resource optimization strategies that reduce cloud infrastructure costs by 20%+ while maintaining performance
Business Impact & Innovation
RAG System Performance: Contribute to measurable improvements in retrieval accuracy, relevance scores, and user satisfaction metrics
Time to Market: Reduce time required to onboard new data sources from weeks to days through automated pipeline development
Scalability Achievement: Successfully scale data processing capabilities to support 10x growth in document volume and user queries
Knowledge Sharing: Create reusable frameworks, documentation, and best practices that improve team productivity by 25%+
Tools & Technologies
Data Processing & Pipeline Development
Programming Languages: Python (primary), SQL (expert), Java/Scala (working knowledge)
Data Processing: Apache Spark, Apache Airflow, Apache Kafka, Pandas, NumPy
Workflow Orchestration: Airflow, Prefect, Dagster, or custom scheduling systems
Data Validation: Great Expectations, Deequ, custom validation frameworks
AI/ML & Vector Technologies
Vector Databases: Pinecone, Milvus, Weaviate, Chroma, Qdrant, FAISS
Document Processing: PyPDF2, unstructured, BeautifulSoup, lxml, pdfplumber
OCR & Content Extraction: Tesseract, Amazon Textract, Azure AI Vision, Google Cloud Vision API
RAG Frameworks: LangChain, LlamaIndex, Haystack, custom implementations
Cloud & Infrastructure
Cloud Platforms: AWS (S3, Lambda, ECS, RDS), Azure (Blob Storage, Functions, AKS), GCP (Cloud Storage, Cloud Functions, GKE)
Containerization: Docker, Kubernetes, container registries
Infrastructure as Code: Terraform, CloudFormation, Pulumi
Monitoring: Prometheus, Grafana, CloudWatch, Azure Monitor, Google Cloud Monitoring
Development & Collaboration
Version Control: Git, GitHub, GitLab with branching strategies and code review processes
CI/CD: GitHub Actions, Jenkins, Azure DevOps, GitLab CI
Documentation: Confluence, Notion, Sphinx, MkDocs
Communication: Slack, Microsoft Teams, Jira, Linear
Category
Salary
Posted
Location
( Remote )
Share or copy
