Supercomputing Engineer (Network)
Company: Etched
Location: San Jose
Posted on: April 1, 2026
|
|
|
Job Description:
About Etched Etched is building the world’s first AI inference
system purpose-built for transformers - delivering over 10x higher
performance and dramatically lower cost and latency than a B200.
With Etched ASICs, you can build products that would be impossible
with GPUs, like real-time video generation models and extremely
deep & parallel chain-of-thought reasoning agents. Backed by
hundreds of millions from top-tier investors and staffed by leading
engineers, Etched is redefining the infrastructure layer for the
fastest growing industry in history. Job Summary We are seeking
highly motivated and skilled Supercomputing Engineers (Network) to
join our team. This team plays a critical role in developing,
qualifying, and optimizing high-performance networking solutions
for large-scale inference workloads. As a Pod Software Engineer,
you will focus on developing and qualifying software that drives
communication amongst Sohu inference nodes in multi-rack inference
clusters. You will collaborate closely with kernel, platform, and
telemetry teams to push the boundaries of peer-to-peer RDMA
efficiency. Key Responsibilities Design, develop, and implement
RDMA based networking peering, supporting high bandwidth, low
latency communication across PCIe nodes within and across racks.
Includes work across Operating System, kernel drivers, embedded
software and system software. Develop tests that qualify host
processors (x86),. NICs, TORs and device network interfaces for
high performance. Furnish burn-in teams with tests that represent
both real-world use cases and workloads for device to device
networking, and extreme-load stress testing. Define the key metrics
that system software must collect to maintain high availability and
performance under extreme communications workloads. Representative
Projects Analyze performance deviations, optimize network stack
configurations, and propose kernel tuning parameters for
low-latency, high-bandwidth inference workloads. Design and execute
automated qualification tests for RDMA NICs and interconnects
across various server configurations. Identify and root-cause
firmware, driver, and hardware issues that impact RDMA performance
and reliability. Collaborate with ODMs and silicon vendors to
validate new RDMA features and enhancements. Implement and validate
peer RDMA support for GPU-to-GPU and accelerator-to-accelerator
communication. Modify kernel drivers and user-space libraries to
optimize direct memory access between inference pods. Profile and
benchmark inter-node RDMA latency and bandwidth to improve
inference job scaling. Optimize NIC and switch configurations to
balance throughput, congestion control, and reliability. You may be
a good fit if you have Proficiency in C/C++ Proficiency in at least
one scripting language (e.g., Python, Bash, Go). Strong experience
with device-to-device networking technologies (RDMA, GPUDirect,
etc.), including RoCE. Experience with zero-copy networking, RDMA
verbs and memory registration. Familiarity with queue pairs,
completions queues, and transport types. Strong understanding of
operating systems (Linux preferred) and server hardware
architectures. Ability to analyze complex technical problems and
provide effective solutions. Excellent communication and
collaboration skills. Ability to work independently and as part of
a team. Experience with version control systems (e.g., Git).
Experience with reading and interpreting hardware logs. Strong
candidates will have (Nice to have qualifications) Experience with
networking technologies like NVLink, Infiniband, ML Pod
interconnects. Experience with widely deployed Top of Rack Switches
(Cisco, Juniper, Arista, etc.) Knowledge of server virtualization.
Experience with tracing tools like perf, eBPF, ftrace, etc.
Experience with performance testing and benchmarking tools (gProf,
vTune, Wireshark, etc.). Familiarity with hardware diagnostic tools
and techniques Experience with containerization technologies (e.g.,
Docker, Kubernetes). Experience with CI/CD pipelines. Experience
with Rust. Worked on GPU or TPU pods, specifically in the
networking domain. Understand up-time challenges of very big ML
deployments. Actively debugged complex network topologies,
specifically dealing with cases of node dropouts/failures,
route-arounds, and pod resiliency at large. Understand performance
implications of Pod Networking SW. Benefits Medical, dental, and
vision packages with generous premium coverage $500 per month
credit for waiving medical benefits Housing subsidy of $2k per
month for those living within walking distance of the office
Relocation support for those moving to San Jose (Santana Row)
Various wellness benefits covering fitness, mental health, and more
Daily lunch dinner in our office How we’re different Etched
believes in the Bitter Lesson . We think most of the progress in
the AI field has come from using more FLOPs to train and run
models, and the best way to get more FLOPs is to build
model-specific hardware. Larger and larger training runs encourage
companies to consolidate around fewer model architectures, which
creates a market for single-model ASICs We are a fully in-person
team in West San Jose, and greatly value engineering skills. We do
not have boundaries between engineering and research, and we expect
all of our technical staff to contribute to both as needed.
Keywords: Etched, Castro Valley , Supercomputing Engineer (Network), IT / Software / Systems , San Jose, California