AI Data Engineer

AI Data Engineer

Engineering

|

Data Engineer - Generative AI & RAG Systems

Data Engineer - Generative AI & RAG Systems

Job Description

We are seeking a specialized Data Engineer to design, build, and optimize data preparation and ingestion pipelines for Retrieval-Augmented Generation (RAG) systems and other Generative AI applications. This role focuses exclusively on the critical data foundation that powers AI systems—transforming raw, unstructured data into AI-ready formats through sophisticated parsing, chunking, embedding, and vector storage workflows.

You will work at the intersection of traditional data engineering and cutting-edge AI technologies, building scalable pipelines that process diverse document types (PDFs, Word docs, HTML, JSON) and prepare them for semantic search and retrieval. This is not a model training or tuning role, but rather the essential data infrastructure that enables AI systems to access and utilize private, domain-specific knowledge effectively.

Responsibilities

RAG Data Pipeline Architecture & Development

  • Design and implement end-to-end data pipelines for RAG systems following industry-standard multi-stage processing: Raw Data → Ingestion → Parsing → Enrichment → Chunking → Embedding → Vector Storage → Retrieval

  • Build scalable document processing workflows that handle diverse file formats including PDFs, Word documents, HTML, JSON, XML, and multimedia content

  • Develop robust parsing logic using libraries like PyPDF2, unstructured, BeautifulSoup, and OCR tools (Tesseract, Amazon Textract, Azure AI Vision) for accurate content extraction

  • Implement intelligent chunking strategies including fixed-size, semantic, and overlapping techniques optimized for retrieval performance and context preservation

Vector Database Management & Optimization

  • Design and implement vector storage solutions using modern vector databases (Pinecone, Milvus, Weaviate, Chroma, Qdrant) for efficient similarity search and retrieval

  • Optimize embedding generation workflows using sentence transformers and embedding APIs, ensuring consistent vector representations across document collections

  • Implement hybrid search capabilities combining vector similarity with traditional keyword search for enhanced retrieval accuracy

  • Monitor and tune vector database performance including indexing strategies, query optimization, and resource utilization

Data Quality & Enrichment Systems

  • Develop comprehensive data validation frameworks ensuring high-quality input for AI systems, including schema validation, content quality checks, and automated error detection

  • Implement sophisticated deduplication algorithms at both document and chunk levels to eliminate redundant information and improve retrieval efficiency

  • Build metadata extraction and enrichment pipelines that automatically generate document-level, content-based, structural, and contextual metadata for enhanced searchability

  • Create data lineage tracking systems ensuring complete traceability from raw sources to vector storage for compliance and debugging

Production Pipeline Operations & Monitoring

  • Build and maintain production-grade data pipelines with 99.9%+ uptime requirements, implementing robust error handling, retry mechanisms, and graceful degradation

  • Develop comprehensive monitoring and alerting systems for data quality, pipeline performance, processing latency, and system health metrics

  • Implement incremental processing capabilities that efficiently handle new data without requiring full reprocessing of existing document collections

  • Create automated testing frameworks for data pipeline validation, including unit tests, integration tests, and end-to-end workflow verification

Cross-Functional Collaboration & Technical Leadership

  • Partner closely with AI/ML engineers, product managers, and business stakeholders to understand data requirements and optimize pipeline performance for specific use cases

  • Collaborate with infrastructure teams to ensure scalable, cost-effective deployment of data processing workflows across cloud platforms (AWS, Azure, GCP)

  • Provide technical guidance on data architecture decisions, tool selection, and best practices for RAG system implementation

  • Document technical specifications, data schemas, and operational procedures for knowledge sharing and team onboarding

Requirements

Education & Experience

  • Bachelor's degree in Computer Science, Data Engineering, Software Engineering, or related technical field

  • 3+ years of experience in data engineering, data pipeline development, or related roles with production system responsibility

  • 1+ years of hands-on experience with AI/ML systems, particularly RAG implementations, vector databases, or document processing workflows

  • Proven track record of building and maintaining large-scale data processing systems handling millions of documents or records

Core Data Engineering Expertise

  • Programming Proficiency: Expert-level Python skills with experience in data processing libraries (Pandas, NumPy, Scikit-learn), plus working knowledge of SQL, Java, or Scala

  • Data Processing Frameworks: Hands-on experience with Apache Spark, Apache Airflow, Apache Kafka, or similar distributed processing and orchestration tools

  • Database Technologies: Strong experience with both relational databases (PostgreSQL, MySQL) and NoSQL systems (MongoDB, Cassandra, Elasticsearch)

  • Cloud Platform Expertise: Practical experience with at least one major cloud platform (AWS, Azure, GCP) including data services, storage solutions, and compute resources

AI/RAG-Specific Technical Skills

  • Vector Database Experience: Hands-on implementation experience with vector databases such as Pinecone, Milvus, Weaviate, Chroma, or Qdrant

  • Document Processing: Proven experience with document parsing libraries and tools including PyPDF2, unstructured, BeautifulSoup, lxml, and OCR technologies

  • RAG Framework Knowledge: Familiarity with RAG development frameworks such as LangChain, LlamaIndex, or custom pipeline implementations

  • Embedding Technologies: Understanding of sentence transformers, embedding models, and vector similarity concepts for semantic search applications

Infrastructure & DevOps Skills

  • Containerization: Experience with Docker and Kubernetes for scalable application deployment and orchestration

  • CI/CD Pipelines: Proficiency with automated testing, deployment, and monitoring using tools like GitHub Actions, Jenkins, or similar platforms

  • Infrastructure as Code: Experience with Terraform, CloudFormation, or similar tools for reproducible infrastructure management

  • Monitoring & Observability: Hands-on experience with logging, monitoring, and alerting systems for production data pipelines

Preferred Skills

Preferred Qualifications

Advanced Technical Experience

  • Master's degree in Computer Science, Data Science, or related field with focus on distributed systems or machine learning

  • 5+ years of data engineering experience with at least 2 years specifically in AI/ML data pipeline development

  • Experience with real-time data processing and streaming architectures for dynamic RAG systems

  • Knowledge of multimodal data processing including text, images, audio, and video content for comprehensive RAG implementations

Specialized Domain Knowledge

  • Experience with enterprise data governance frameworks, data privacy regulations (GDPR, HIPAA), and compliance requirements

  • Background in natural language processing (NLP) concepts, text preprocessing, and linguistic analysis techniques

  • Familiarity with MLOps practices including model versioning, experiment tracking, and automated model deployment pipelines

  • Experience with federated learning or distributed AI systems across multiple data sources and organizations

Industry & Leadership Experience

  • Previous experience in AI-first companies, technology consulting, or enterprise AI transformation initiatives

  • Track record of mentoring junior engineers and contributing to technical decision-making processes

  • Experience presenting technical concepts to non-technical stakeholders and translating business requirements into technical specifications

  • Contributions to open-source projects related to data engineering, AI/ML, or RAG system development

Success Metrics & Performance Indicators

Technical Excellence

  • Data Quality: Achieve 99%+ accuracy in document processing and content extraction across diverse file formats

  • Pipeline Reliability: Maintain 99.9%+ uptime for critical data processing workflows with minimal manual intervention

  • Processing Performance: Optimize pipeline throughput to handle increasing data volumes while maintaining sub-second retrieval times

  • Cost Efficiency: Implement resource optimization strategies that reduce cloud infrastructure costs by 20%+ while maintaining performance

Business Impact & Innovation

  • RAG System Performance: Contribute to measurable improvements in retrieval accuracy, relevance scores, and user satisfaction metrics

  • Time to Market: Reduce time required to onboard new data sources from weeks to days through automated pipeline development

  • Scalability Achievement: Successfully scale data processing capabilities to support 10x growth in document volume and user queries

  • Knowledge Sharing: Create reusable frameworks, documentation, and best practices that improve team productivity by 25%+

Tools & Technologies

Data Processing & Pipeline Development

  • Programming Languages: Python (primary), SQL (expert), Java/Scala (working knowledge)

  • Data Processing: Apache Spark, Apache Airflow, Apache Kafka, Pandas, NumPy

  • Workflow Orchestration: Airflow, Prefect, Dagster, or custom scheduling systems

  • Data Validation: Great Expectations, Deequ, custom validation frameworks

AI/ML & Vector Technologies

  • Vector Databases: Pinecone, Milvus, Weaviate, Chroma, Qdrant, FAISS

  • Document Processing: PyPDF2, unstructured, BeautifulSoup, lxml, pdfplumber

  • OCR & Content Extraction: Tesseract, Amazon Textract, Azure AI Vision, Google Cloud Vision API

  • RAG Frameworks: LangChain, LlamaIndex, Haystack, custom implementations

Cloud & Infrastructure

  • Cloud Platforms: AWS (S3, Lambda, ECS, RDS), Azure (Blob Storage, Functions, AKS), GCP (Cloud Storage, Cloud Functions, GKE)

  • Containerization: Docker, Kubernetes, container registries

  • Infrastructure as Code: Terraform, CloudFormation, Pulumi

  • Monitoring: Prometheus, Grafana, CloudWatch, Azure Monitor, Google Cloud Monitoring

Development & Collaboration

  • Version Control: Git, GitHub, GitLab with branching strategies and code review processes

  • CI/CD: GitHub Actions, Jenkins, Azure DevOps, GitLab CI

  • Documentation: Confluence, Notion, Sphinx, MkDocs

  • Communication: Slack, Microsoft Teams, Jira, Linear

Category

Engineering

Engineering

Salary

170k - 280k/year

170k - 280k/year

Posted

3 months ago
4 months ago
4 months ago

Location

United States

United States

( Remote )

Share or copy

linkedin iconfacebook iconx icon

Not the right job for you?