Lead Data Engineer
LGND AI, Inc.
San Francisco, california
Job Details
Full-time
Full Job Description
About LGND
LGND is an early-stage startup revolutionizing geospatial AI infrastructure. We bridge the gap between large Earth observation models and specific application developers, enabling intuitive interaction with geospatial data. Our core mission is to empower decision-makers with rapid insights from vast, complex datasets. As part of our small, dynamic team, you will play a foundational role in building tools that have never existed before.
Role Summary
We are seeking a Lead Data Engineer to design, build, and scale our inference pipeline for geospatial embeddings. This pipeline is the backbone of LGND’s technological product, integrating with a point-and-click web application to generate embeddings for geographic areas of interest based on user-defined parameters. These embeddings will populate a custom vector database designed for massive scale and speed.
The ideal candidate is a seasoned engineer with experience in production-grade data pipelines, thrives under uncertainty, and is eager to collaborate across engineering, DevOps, and science disciplines. AI and geospatial experience are not required, if you are willing to learn fast with our help. Over time, this role will evolve into an engineering lead position, overseeing all technological components while focusing on engineering excellence.
Role is remote. We have team members in San Francisco, Philadelphia, and Coppenhagen.
Key Responsibilities
- Build the Inference Pipeline:
- Develop a scalable, efficient pipeline to generate geospatial embeddings based on user input, integrating parameters such as geographic area, model type, time range, tiling strategy, and imagery source.
- Balance pre-processed tokens (e.g., cloud-free Sentinel imagery) with on-the-fly inference for optimal performance.
- Ensure the pipeline supports billions of embeddings at scale and leverages advanced compute capabilities for fast inference, mostly on commercial clouds but also local resources..
- Integration and Collaboration:
- Work closely with front-end engineers to ensure seamless integration of the pipeline into a user-friendly web application.
- Collaborate with leadership to determine which components of the pipeline and storage system should remain proprietary versus open-source.
- Partner with external groups like AWS and Asterik Labs for open-source contributions and technical integrations.
- Scalability and Professionalism:
- Design a pipeline that other high-level data engineers can immediately inherit and build upon.
- Move large amounts of data around professionally, focusing on scale, extensibility, and maintainability.
- Ensure compliance with best practices in data engineering, DevOps, and MLOps.
- Enhance Existing Projects:
- Build upon existing foundational work to increase pipeline speed, scale, and extensibility. Key repositories include:
- embeddings-worker: A Python module that creates vector embeddings of satellite images using the Clay Foundation Model. The system splits geographic regions into smaller chips, processes them in a distributed manner, and manages status tracking in a database.
- embeddings-api: A REST API module that manages the vector database and orchestrates embedding generation tasks. It includes robust endpoints for scheduling geographic regions for processing, retrieving task status, and searching for similar vectors.
- Future Leadership:
- Serve as the lead for the inference pipeline, one of four core technological components at LGND (inference pipeline, fine-tuning and retrieval algorithms, vector search database, and SDK).
- Optionally grow into an engineering manager role, overseeing future hires and cross-functional development efforts.
Scope of Work: First Two Months
- Increase the Speed and Scale of the Pipeline:
- Optimize the inference pipeline to efficiently handle the generation of embeddings at massive scale.
- Focus on performance improvements to support billions of embeddings and reduce inference runtime.
- Tokenize Source Imagery:
- Develop a process to "tokenize" source imagery for a given geographic region and time range.
- Produce image chips according to the large Earth observation model architecture.
- Store these image chips in Amazon S3 for easy recall during subsequent inference runs.
- Run Model Inference:
- Implement the pipeline to run inference on a couple of existing, pre-trained models.
- Output the resulting embeddings and store them in a scalable, performant vector search database.
- Collaborate with external partners, such as AWS, to ensure pipeline compatibility with the vector database infrastructure.
- Nice-to-Have Feature:
- Develop functionality to process source imagery into mosaics to address cloud cover and other image quality issues, improving the quality of inputs for inference.
Scope of Work: First Two Months, expanded
- Operationalize the CLIP-based Retrieval Pipeline
- Implement and optimize a scalable inference pipeline to generate CLIP embeddings (and embeddings from other pre-trained models) for remote sensing imagery.
- Design the system to tokenize source imagery into manageable image chips for specific geographic areas and time ranges. Store these chips efficiently in Amazon S3 for reuse.
- Ensure flexibility to incorporate additional embedding models in the future.
- Experiment with Multi-Modal Retrieval
- Enable interaction with both image and text queries in a combined retrieval framework using pre-trained vision-language models (e.g., CLIP).
- Implement functionality to combine multiple embeddings (image-to-image and text-to-image similarity) and experiment with methods like WEICOM for modality control (e.g., weighted combinations of embeddings).
- Database and API Design
- Collaborate with external partners (e.g., AWS) to design a scalable vector search database capable of handling billions of embeddings.
- Develop APIs to allow efficient storage and retrieval of embeddings based on user-defined queries (geographic area, model, time range, and textual context).
- Pre-Processing for Image Quality (Nice-to-Have)
- Develop a feature to process source imagery into cloud-free mosaics, improving image quality for inference and retrieval.
- Performance Optimization
- Optimize the pipeline for speed, ensuring embeddings can be generated at scale. Explore trade-offs between pre-processed tokens and on-the-fly inference.
- Focus on building a robust, scalable system that reduces latency while maintaining flexibility.
Requirements
Required Technical Skills:
- Proficiency in Python and familiarity with Docker.
- Expertise in building production-grade data pipelines at scale (10+ years of experience preferred).
- Familiarity with tools and frameworks like:
- Geospatial libraries: numpy, pandas, rasterio, geopandas, xarray.
- Machine learning: PyTorch (torch, torchdata, torchvision), timm, einops.
- Cloud integration: boto3 for AWS.
- Database management: SQLAlchemy, GeoAlchemy2, pgvector, psycopg2.
- Experience with inference pipelines, including pre-processing and real-time inference strategies.
Preferred Experience:
- Familiarity with satellite image formats and protocols (e.g., STAC, Cloud Optimized GeoTIFFs, Zarr).
- Experience with AWS infrastructure (bonus, not required).
- Background in MLOps and geospatial AI applications.
Soft Skills:
- Self-led and able to navigate uncertainty.
- Excited by the opportunity to build tools and systems that have never been built before.
- Collaborative, humble, and eager to learn.
Benefits
Cultural Values
- Humility: You value collaboration and learning from others.
- Integrity: You uphold honesty and transparency in your work.
- Effectiveness: You are results-driven, with a focus on building scalable, impactful solutions.
Compensation and Benefits
- Competitive salary based on experience.
- Equity options in a Seed Stage Startup
- Flexible work arrangements.
- Opportunity to play a foundational role in shaping LGND’s technological infrastructure.