shauray singh

computers can see, hear and learn. welcome to the future!

twitter github

An idiot admires complexity. A genius admires simplicity. A physicist tries to make it simple. Anything that is gaining complexity, more the idiot will admire it. If you make something more clusterfuck that he can't understand, he will think you are a god because you made it so complicated nobody can understand it.

- Terry Davis


// work

01/2024 - present
ML guy @ ModelsLab
Writing performant kernels for diffusion models. mostly migrated to cuteDSL now. Writing kernels for Blackwell these days. Did experiments with distributed training, distillation, some SRPO exps. Mostly video models now making them faster, squeezing better quality from opensource models. Building tools and launching products.
  • I'm all about PyTorch, so if you're familiar with those ecosystems, we're off to a great start.
  • I've got a love-hate relationship with CUDA and Triton - love the performance, hate the debugging headaches - maybe not the torch profiler
  • I'm always on the lookout for ways to parallelize model inference, so if you've got experience with parallel processing, let's chat.

updates on twitter

01/2023 - 04/2023
Senior Staff @ Zummit InfoLabs
Helped design companies' ML pipelines for better and faster deployment and improved access times.
10/2022 - 01/2023
Jr. Data Scientist @ Zummit InfoLabs
I led a team of 5 interns and made some significant advancements in Demand Predictions which used some statistical methods and was then migrated to work with graph neural networks. Learned a lot about graph neural networks and how spatial-temporal graphs can make things faster and easier than some other traditional methods.
08/2020 - 10/2024
B.Tech @ Manipal University Jaipur
Major in Data Science, minor in Finance. This is where I got really deep into deep learning by browsing the internet and attending random lectures on youtube in my dorm room.

// Blackwell Optims for WAN2.2

This is what I'm working in my free time right now, mostly gathering up free lunches for optims and some CuteDSL kernels, currently adopting Tri Dao's FA4 for WAN2.2 since I won't need any causal masking or kv cache optimizations I'm striping it down to barebones and comparing what %sol I can achive with the knowledge I have. Also I post a lot of profiles related to these opims on X.

// open source

I have a deep deep love for OpenSource. It's like lifeblood of the tech world, and I'm all in! I was a regular contributor over at Hugging Face, rolling up my sleeves and getting things done. From Transformers to tokenizers, I'm in the trenches, making things better. Sharing some of the wisdom with blogs and some low quality tweets! So, remember, open source is where it's at.

// project graveyard

GRAT-X for FLUX - Un-official implementation of GRAT-X for FLUX.1-dev. It’s a criss-cross attention mechanism that should be fast atleast on huge images.
Film Grain - some post-processing nodes for video's and image's, a simple repo transfering comfy nodes for post-processing to python functions.
Astra Video - Contains random experiments involving video models mainly an intrpolation experiment with IP-Adapters to generate a raw 16fps video. And some more multiframe interpolation stuff.
Distil V0.1 - Contains random experiments involving flow and diffusion based models most notabily has codes for noise injection, DPM adapted to flow matching and a bunch more.
Regional Prompting SD3.5 - Ported SD3.5 to work with Training-free Regional Prompting for Diffusion Transformers. Results aren't that crazy but the ideas was pretty cool atleast at that time.
Turing - was supposed to be just a simple library built on top of numpy which would calculate gradients for your neural network with some additional features that would help in the development BUT since then my vision for Turing got a little bigger and now Turing has a C backend for tensor support with custom matmul accelerators, AVX512 for multi-core CPU's and now supports CUDA backend. Basically, this is what PyTorch is but way way way more complex, I wanted something not so complex but powerful enough!
OnlyGans - is currently under development and is NSFW. OnlyGans uses a fleet of SOTA generative models to generate naked individuals which would work something like thispersondoesnotexist.com. uses some version of StyleGan to generate images and may support videos in the future. Support this noble cause by contributing to the repo and make this world a better place to live!
Trachea - transforms your speech to a text output. It uses a simple convolution model on mel spectrograms generated by a custom library which when presented with an audio sample would create a mel spectrogram and then the pixel data would get squished and smashed by the neural network to generate a text representation of that particular sample! It was a simple project for my deep learning class that I took in college.
Context Tree Weighting - was again a college project in which I tried implementing different compression algorithms as a baseline and then tried implementing Context Tree Weighting which was not very easy and I would like to get back to it someday maybe I can have a better understanding of how compression works!!
Calib-Challenge - was a challenge organized by Comma AI. The repo contains the code for predicting the YAW and Pitch of a moving car from the dashcam video. The project kept on going after the contest as well and now it contains some very basic code for a self-driving car. Just the software part of it, there are no bound conditions yet and it just gives out raw data that has to be processed before giving it to the car's computer. Pedal is the Operating System for the same and is UNIXish. I wanna drive a fully autonomous car with this code someday (with some improvement of course!)
Quadruple Inverted Pendulum - was my first attempt at writing a research paper that failed (OF COURSE) it was pretty straight forward actually. It's basically 4 pendulums attached to one another and the task was to balance them on top of the prev pendulum without falling. But turns out calculating Lagrangian equations for this setup is pretty straightforward but has a lot of calculations and just to simulate the pendulums there are hundreds of thousands of calculations every sec. But then I found something very useful called GameGAN but again there was no open source code for that and I DROPPED THE IDEA!
misc - I built a lot of other random stuff over time. GO check it on my GitHub and if you wanna make a change in the world and as George Hotz says "wanna win over nature" find a project on GitHub or maybe I can help you find one visit https://github.com/shauray8 and fork some repositories and start adding stuff!

// misc unsorted brain dump


we must know, we will know

stem stuff. requires attention span.

Refer Twitter for microblogs, this is for full length technical deep dives. Writing on STEM related topics and experimenting with the Feynman technique.