At the moment it's a an exact-match, later will support partial match as well.
Stas Bekman
stas
AI & ML interests
Toolmaker. Software creator, optimizer and harmonizer.
Makes things work and fly at Snowflake AI Research
Training LLM/RAG/Generative AI/Machine Learning/Scalability
Recent Activity
repliedto their post about 13 hours ago
After many months of intense work the
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL
https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/
- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3% updated a model about 14 hours ago
stas/the-art-of-debugging-book published a model about 14 hours ago
stas/the-art-of-debugging-bookOrganizations
replied to their post about 13 hours ago
replied to their post about 14 hours ago
Thank you for the kind words, Dipankar
The lion share of speed up comes from prompt deduplication during generation and training.
posted an update 1 day ago
Post
99
In parallel we announce a new open source repo:
https://github.com/Snowflake-AI-Research/Arctic-Platform
This is the framework for very fast RL (and future other optimizations rolled into it)
It currently has all the code you need to use or integrate Arctic RL into RL frameworks, with SkyRL and Verl available and more framework integrations coming.
Please kindly spread the word! Thank you!
https://github.com/Snowflake-AI-Research/Arctic-Platform
This is the framework for very fast RL (and future other optimizations rolled into it)
It currently has all the code you need to use or integrate Arctic RL into RL frameworks, with SkyRL and Verl available and more framework integrations coming.
Please kindly spread the word! Thank you!
posted an update 1 day ago
Post
1097
After many months of intense work the
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL
https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/
- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL
https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/
- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%
posted an update 14 days ago
Post
136
PSA for DeepSpeed users - a long outstanding precision-related critical bug has been identified and fixed in https://github.com/deepspeedai/DeepSpeed/pull/8066 and a new release has been made.
The issue was about mixed precision mode downcasting buffers that had to be in fp32 - massively impacting correctness due to large static buffers - e.g. RoPE in Qwen3 models when using long sequence lengths 32K+.
Hopefully this fix brings Deepspeed to a close parity with FSDP2 which has been an issue since a long time.
You can still have the old behavior but you'd now need to manually configure it - by default the model's buffers will now remain in the original precision.
Please install deepspeed==0.19.2 which will do the right thing.
Thanks to Tunji Ruwase and Claude Opus 4.8 via Cursor for identifying and fixing the problem.
The issue was about mixed precision mode downcasting buffers that had to be in fp32 - massively impacting correctness due to large static buffers - e.g. RoPE in Qwen3 models when using long sequence lengths 32K+.
Hopefully this fix brings Deepspeed to a close parity with FSDP2 which has been an issue since a long time.
You can still have the old behavior but you'd now need to manually configure it - by default the model's buffers will now remain in the original precision.
Please install deepspeed==0.19.2 which will do the right thing.
Thanks to Tunji Ruwase and Claude Opus 4.8 via Cursor for identifying and fixing the problem.
posted an update 4 months ago
Post
255
Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into
HuggingFace Trainer, Accelerate and TRL
For extensive details please see this writeup:
https://huggingface.co/blog/ulysses-sp
Thanks a lot to Kashif Rasul for helping make it happen. Also the others in the HF team who helped with integration.
HuggingFace Trainer, Accelerate and TRL
For extensive details please see this writeup:
https://huggingface.co/blog/ulysses-sp
Thanks a lot to Kashif Rasul for helping make it happen. Also the others in the HF team who helped with integration.
posted an update over 1 year ago
Post
2390
Do you want ArcticTraining at @SnowflakeDB to add an ability to post-train DeepSeek V3/R1 models with DPO using just a few GPU nodes?
Please vote here and tell others about it: https://github.com/snowflakedb/ArcticTraining/discussions/58
ArcticTraining is an open-source, easy to use post-training framework for NVIDIA GPUs built on top of DeepSpeed.
Please vote here and tell others about it: https://github.com/snowflakedb/ArcticTraining/discussions/58
ArcticTraining is an open-source, easy to use post-training framework for NVIDIA GPUs built on top of DeepSpeed.
posted an update over 1 year ago
Post
1269
If you remember my work on MAMF - to find the realistic TFLOPS achievable ceiling - the Intel AI team has shared their measurements and they scored ...
an incredible 99.4% TFLOPS efficiency for Gaudi 2!
That's quite amazing! Your ROI on these accelerators will be very high.
The full table is here: https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-matmul-flops-comparison-table
As we have seen the competitors get their achievable efficiency worse with each new generation, I'm looking forward to see if Gaudi 3 will keep the high bar!
Thanks to Avi Rubin, Lakshman Chari, Imtiaz Sajwani, Ramy J and Zhiqi Tao for helping to get these numbers to the community.
an incredible 99.4% TFLOPS efficiency for Gaudi 2!
That's quite amazing! Your ROI on these accelerators will be very high.
The full table is here: https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-matmul-flops-comparison-table
As we have seen the competitors get their achievable efficiency worse with each new generation, I'm looking forward to see if Gaudi 3 will keep the high bar!
Thanks to Avi Rubin, Lakshman Chari, Imtiaz Sajwani, Ramy J and Zhiqi Tao for helping to get these numbers to the community.
posted an update almost 2 years ago
Post
1200
The Universal Checkpointing paper is out! https://arxiv.org/abs/2406.18820
If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.
Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.
The blog post is here: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/README.md
If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.
Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.
The blog post is here: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/README.md
Post
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.
Check out their post here: https://pytorch.org/blog/maximizing-training/
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.
Check out their post here: https://pytorch.org/blog/maximizing-training/
posted an update over 2 years ago
Post
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.
Check out their post here: https://pytorch.org/blog/maximizing-training/
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.
Check out their post here: https://pytorch.org/blog/maximizing-training/
posted an update over 2 years ago
Post
If you're trying to run MoE Mixtral-8x7b under DeepSpeed w/ HF Transformers it's likely to hang on the first forward.
The solution is here https://github.com/microsoft/DeepSpeed/pull/4966?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US#issuecomment-1989671378
and you need deepspeed>=0.13.0
Thanks to Masahiro Tanaka for the fix.
The solution is here https://github.com/microsoft/DeepSpeed/pull/4966?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US#issuecomment-1989671378
and you need deepspeed>=0.13.0
Thanks to Masahiro Tanaka for the fix.
replied to their post over 2 years ago
I pinged Elio to see if he wants to join.
posted an update over 2 years ago
Post
Hear, hear, AMD MI300Xs have started to emerge much sooner than expected.
Here is a 2-part benchmarks report on performing BLOOM-176B inference using @MSFTDeepSpeed optimized for AMD MI300X.
1. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing
2. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing-part-2
This was published in response to our BLOOM-176B super-fast inference blog post https://huggingface.co/blog/bloom-inference-pytorch-scripts
Note that these have 192GB of HBM!
The NVIDIA monopoly is strong, but it'll have to start sharing the pie and hopefully drive the costs down at least somewhat.
Thanks to https://www.linkedin.com/in/eliovp for sharing this writeup with me.
p.s. at the PyTorch conference in the fall, the AMD representative said we will see MI300X available to us mortals in Q4-2024/Q1-2025.
Here is a 2-part benchmarks report on performing BLOOM-176B inference using @MSFTDeepSpeed optimized for AMD MI300X.
1. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing
2. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing-part-2
This was published in response to our BLOOM-176B super-fast inference blog post https://huggingface.co/blog/bloom-inference-pytorch-scripts
Note that these have 192GB of HBM!
The NVIDIA monopoly is strong, but it'll have to start sharing the pie and hopefully drive the costs down at least somewhat.
Thanks to https://www.linkedin.com/in/eliovp for sharing this writeup with me.
p.s. at the PyTorch conference in the fall, the AMD representative said we will see MI300X available to us mortals in Q4-2024/Q1-2025.
replied to their post over 2 years ago
Thank you for the kind words, Jeff!
We are still waiting for BLOOM v2.0 from HF!
posted an update over 2 years ago
Post
"The Case for Co-Designing Model Architectures with Hardware"
This is a long overdue paper that we have started discussing back when training BLOOM-176.
Basically this paper tells you how to design your model's dimensions for an optimal training throughput.
Fantastic!
Yours truly contributed the SwiGLU section ;)
https://twitter.com/QuentinAnthon15/status/1752393989813375119
https://arxiv.org/abs/2401.14489
This is a long overdue paper that we have started discussing back when training BLOOM-176.
Basically this paper tells you how to design your model's dimensions for an optimal training throughput.
Fantastic!
Yours truly contributed the SwiGLU section ;)
https://twitter.com/QuentinAnthon15/status/1752393989813375119
https://arxiv.org/abs/2401.14489
reacted to clem's post with โค๏ธ over 2 years ago
Post
Google + Hugging Face + Open-Source AI = ๐ฅ๐ฅ๐ฅ
https://huggingface.co/blog/gcp-partnership
https://finance.yahoo.com/video/google-hugging-face-alliance-spur-173016882.html
https://www.theverge.com/2024/1/25/24050445/google-cloud-hugging-face-ai-developer-access
https://www.bloomberg.com/news/articles/2024-01-25/google-to-team-up-with-startup-hugging-face-to-host-ai-software
https://www.reuters.com/technology/google-cloud-partners-with-hugging-face-attract-ai-developers-2024-01-25/
https://huggingface.co/blog/gcp-partnership
https://finance.yahoo.com/video/google-hugging-face-alliance-spur-173016882.html
https://www.theverge.com/2024/1/25/24050445/google-cloud-hugging-face-ai-developer-access
https://www.bloomberg.com/news/articles/2024-01-25/google-to-team-up-with-startup-hugging-face-to-host-ai-software
https://www.reuters.com/technology/google-cloud-partners-with-hugging-face-attract-ai-developers-2024-01-25/
reacted to freddyaboulton's post with ๐คฏ over 2 years ago
Post
New in Gradio 4.16.0 - Galleries as Input ๐ผ๏ธ
Now your users can upload multiple images as input to your AI application and view them in a slick gallery!
Attached is a demo of how this new feature can be used in a photomaker-type application: TencentARC/PhotoMaker
Shout out @abidlabs and @akhaliq who proposed this feature after seeing some of the workarounds gradio developers were using in the wild to upload multiple images.
The gradio team works hard to stay up to date with the latest trends in AI! If there's something missing from the library, file an issue on github! https://github.com/gradio-app/gradio/issues
Now your users can upload multiple images as input to your AI application and view them in a slick gallery!
Attached is a demo of how this new feature can be used in a photomaker-type application: TencentARC/PhotoMaker
Shout out @abidlabs and @akhaliq who proposed this feature after seeing some of the workarounds gradio developers were using in the wild to upload multiple images.
The gradio team works hard to stay up to date with the latest trends in AI! If there's something missing from the library, file an issue on github! https://github.com/gradio-app/gradio/issues
posted an update over 2 years ago
Post
Do you have a hidden massive storage leak thanks to HF hub models and datasets revisions adding up and not getting automatically deleted?
Here is how to delete all old revisions and only keeping
In terminal A:
Do not answer the prompt and proceed with my instructions.
(note your tmp file will have a different path, so adjust it below)
In terminal B:
The perl one-liner uncommented out all lines that had
Now go back to terminal A and hit: N, Y, Y, so it looks like:
Done.
If you messed up with the prompt answering you still have
For more details and additional techniques please see https://github.com/stas00/ml-engineering/tree/master/storage#huggingface-hub-caches
Here is how to delete all old revisions and only keeping
main in a few quick steps and no tedious manual editing.In terminal A:
$ pip install huggingface_hub["cli"] -U
$ huggingface-cli delete-cache --disable-tui
File to edit: /tmp/tmpundr7lky.txt
0 revisions selected counting for 0.0. Continue ? (y/N)Do not answer the prompt and proceed with my instructions.
(note your tmp file will have a different path, so adjust it below)
In terminal B:
$ cp /tmp/tmpedbz00ox.txt cache.txt
$ perl -pi -e 's|^#(.*detached.*)|$1|' cache.txt
$ cat cache.txt >> /tmp/tmpundr7lky.txtThe perl one-liner uncommented out all lines that had
(detached) in it - so can be wiped out. And then we pasted it back into the tmp file huggingface-cli expects to be edited.Now go back to terminal A and hit: N, Y, Y, so it looks like:
0 revisions selected counting for 0.0. Continue ? (y/N) n
89 revisions selected counting for 211.7G. Continue ? (y/N) y
89 revisions selected counting for 211.7G. Confirm deletion ? (Y/n) yDone.
If you messed up with the prompt answering you still have
cache.txt file which you can feed again to the new tmp file it'll create when you run huggingface-cli delete-cache --disable-tui again.For more details and additional techniques please see https://github.com/stas00/ml-engineering/tree/master/storage#huggingface-hub-caches