JobHire
face icon
Register to automatically apply for this and similar jobs
Register
star

LLM Data Engineer | United States | Fully Remote

Halo Media

N/Aflorida


Job Details

Full-time


Full Job Description

We are seeking an experienced AI/LLM Data Engineer to build and maintain the data pipeline for our Generative AI platform. The ideal candidate will be well-versed in the latest Large Language Model (LLM) technologies and have a strong background in data engineering, with a focus on Retrieval-Augmented Generation (RAG) and knowledge-base techniques.  This role sits in the AI COE within DX Tech & Digital. As a AI/LLM Data Engineer (you will report into the Director, AI Solutions & Development who oversees the AI COE. 

You will work on highly visible strategic projects, collaborating with cross-functional teams 

to define requirements and deliver high-quality AI solutions. 

The ideal candidate will have a passion for Generative AI and LLMs, with a proven track record of delivering innovative AI applications.

Responsibilities 
• Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including Supervised Fine Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) data processes 
• Identify, evaluate, and integrate diverse data sources and domains to support the Generative AI platform 
• Develop and optimize data processing workflows for chunking, indexing, ingestion, and vectorization for both text and non-text data 
• Benchmark and implement various vector stores, embedding techniques, and retrieval methods 
• Create a flexible pipeline supporting multiple embedding algorithms, vector stores, and search types (e.g., vector search, hybrid search) 
• Implement and maintain auto-tagging systems and data preparation processes for LLMs 
• Develop tools for text and image data crawling, cleaning, and refinement 
• Collaborate with cross-functional teams to ensure data quality and relevance for AI/ML models 
• Work with data lake house architectures to optimize data storage and processing 
• Integrate and optimize workflows using Snowflake and various vector store technologies 

Requirements

• Master's degree in Computer Science, Data Science, or a related field 
• 3-5 years of work experience in data engineering, preferably in AI/ML contexts 
• Proficiency in Python, JSON, HTTP, and related tools 
• Strong understanding of LLM architectures, training processes, and data requirements 
• Experience with RAG systems, knowledge base construction, and vector databases 
• Familiarity with embedding techniques, similarity search algorithms, and information retrieval concepts 
• Hands-on experience with data cleaning, tagging, and annotation processes (both manual and automated) 
• Knowledge of data crawling techniques and associated ethical considerations 
• Strong problem-solving skills and ability to work in a fast-paced, innovative environment 
• Familiarity with Snowflake and its integration in AI/ML pipelines 
• Experience with various vector store technologies and their applications in AI 
• Understanding of data lakehouse concepts and architectures 
• Excellent communication, collaboration, and problem-solving skills. 
• Ability to translate business needs into technical solutions. 
• Passion for innovation and a commitment to ethical AI development. 
• Experience building LLMs pipeline using framework like LangChain, LlamaIndex, Semantic Kernel, OpenAI functions.
• Familiar with different LLM parameters like temperate, top-k, and repeat penalty, and different LLM outcome evaluation data science metrics and methodologies. 

Preferred Skills

  • Experience with popular LLM/ RAG frameworks  
  • Familiarity with distributed computing platforms (e.g., Apache Spark, Dask) 
  • Knowledge of data versioning and experiment tracking tools 
  • Experience with cloud platforms (AWS, GCP, or Azure) for large-scale data processing 
  • Understanding of data privacy and security best practices 
  • Practical experience implementing data lakehouse solutions 
  • Proficiency in optimizing queries and data processes in Snowflake or Databricks
  • Hands-on experience with different vector store technologies

Benefits

  • US employees benefit package.

Get 10x more interviews and get hired faster.

JobHire.AI is the first-ever AI-powered job search automation platformthat finds and applies to relevant job openings until you're hired.

Registration