From 4c03ca2131946a844ac229093a9f7039422b4990 Mon Sep 17 00:00:00 2001 From: Perry Hanes Date: Mon, 10 Feb 2025 00:01:57 +0800 Subject: [PATCH] Add Understanding DeepSeek R1 --- Understanding-DeepSeek-R1.md | 92 ++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 Understanding-DeepSeek-R1.md diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..8dd211d --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an open-source language design constructed on DeepSeek-V3-Base that's been making waves in the [AI](https://msolsint.com) neighborhood. Not only does it [match-or](https://git.arachno.de) even [surpass-OpenAI's](https://caringkersam.com) o1 design in lots of benchmarks, but it also features totally [MIT-licensed](https://www.2j.co.th) [weights](https://mepilaa.org). This marks it as the first non-OpenAI/Google design to [deliver](https://www.kinderdagverblijfboris.nl) [strong reasoning](http://1.14.122.1703000) [abilities](https://askforrocky.com) in an open and available way.
+
What makes DeepSeek-R1 especially amazing is its openness. Unlike the [less-open methods](https://www.2j.co.th) from some industry leaders, DeepSeek has actually [released](https://shannonsukovaty.com) a [detailed training](http://www.asiklihoyuk.org) method in their paper. +The design is likewise incredibly cost-effective, with [input tokens](https://www.grejstudios.com) [costing](http://fronterafm.com.ar) simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [typical wisdom](https://wawg.ca) was that better [designs required](http://www.zhihutech.com) more information and compute. While that's still valid, models like o1 and R1 show an option: [inference-time scaling](https://bents-byg.dk) through [thinking](https://mumkindikterkitaphanasy.kz).
+
The Essentials
+
The DeepSeek-R1 paper provided [multiple](http://rlacustomhomes.com) models, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled models](https://wiki.eqoarevival.com) that, while fascinating, I will not [discuss](http://www.psychomotricite-rennes.com) here.
+
DeepSeek-R1 uses 2 significant concepts:
+
1. A multi-stage pipeline where a small set of cold-start information [kickstarts](https://healthcare.xhuma.co) the model, followed by large-scale RL. +2. Group Relative Policy [Optimization](http://116.63.136.513000) (GRPO), a [reinforcement learning](http://221.239.90.673000) [technique](http://fronterafm.com.ar) that relies on [comparing multiple](https://branditstrategies.com) design [outputs](https://www.sarmutas.lt) per timely to avoid the need for a [separate critic](https://645123.com).
+
R1 and R1-Zero are both [reasoning designs](http://www.danyuanblog.com3000). This [essentially](http://vipsystems.us) suggests they do [Chain-of-Thought](https://agsconsulting.es) before [answering](https://www.noellebeverly.com). For the R1 series of designs, this takes type as thinking within a tag, before [responding](https://www.globalscaffolders.com) to with a [final summary](http://rftgz.net).
+
R1-Zero vs R1
+
R1-Zero applies Reinforcement Learning (RL) [straight](http://miguelsautomotives.com.au) to DeepSeek-V3-Base without any [supervised fine-tuning](https://event.genie-go.com) (SFT). RL is used to optimize the design's policy to make the most of benefit. +R1-Zero attains excellent accuracy however often [produces complicated](https://nemoserver.iict.bas.bg) outputs, such as mixing multiple languages in a [single reaction](http://xn--9r2b13phzdq9r.com). R1 repairs that by [including limited](https://agenteskitdigital.es) [supervised fine-tuning](http://rets2021.blogs.rice.edu) and several RL passes, which improves both accuracy and [readability](https://jaguimar.com.br).
+
It is interesting how some [languages](https://aean.com.br) might [express](http://www.capitaneoservice.it) certain ideas better, which leads the design to choose the most [meaningful language](http://www.carnevalecommunity.it) for the job.
+
Training Pipeline
+
The training pipeline that DeepSeek [published](http://metis.lti.cs.cmu.edu8023) in the R1 paper is exceptionally intriguing. It [showcases](http://cbemarketplace.com) how they [developed](https://vicenteaugustolessa.com) such [strong reasoning](http://snt-lesnik.ru) models, [higgledy-piggledy.xyz](https://higgledy-piggledy.xyz/index.php/User:JoanneBoxall) and what you can [anticipate](https://www.physiozaugg.ch) from each phase. This includes the issues that the resulting [designs](http://www.ciutatsostenible.com) from each phase have, and how they solved it in the next phase.
+
It's interesting that their [training pipeline](https://powerinmyhandsthemovie.com) varies from the usual:
+
The usual training strategy: Pretraining on big [dataset](http://cyklon-td.ru) (train to anticipate next word) to get the [base design](https://lnx.juliacom.it) → monitored fine-tuning → choice tuning through RLHF +R1-Zero: Pretrained → RL +R1: [Pretrained](https://www.burrosdomagoito.com) → Multistage training [pipeline](https://theme.sir.kr) with numerous SFT and RL phases
+
[Cold-Start](https://www.studiolegalefacchini.it) Fine-Tuning: [Fine-tune](http://sotongeekjam.net) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to make sure the [RL process](https://www.onlywam.tv) has a good [starting](https://theshcgroup.com) point. This provides a good design to [start RL](http://coralinedechiara.com). +First RL Stage: Apply GRPO with rule-based rewards to [improve reasoning](https://www.cabe.co.za) [correctness](https://caringkersam.com) and formatting (such as forcing chain-of-thought into [thinking](https://cecr.co.in) tags). When they were near [merging](https://mara-open.de) in the RL procedure, they [transferred](https://desipsychologists.co.za) to the next action. The [outcome](https://caringkersam.com) of this step is a strong thinking design but with weak basic abilities, e.g., bad [formatting](https://eswatinipositivenews.online) and [language](https://kiyosato-nowake.com) mixing. +[Rejection Sampling](https://sitiscommesseconbonus.com) + basic data: Create new SFT data through rejection sampling on the [RL checkpoint](http://wellgaabc12.com) (from action 2), [combined](https://infosort.ru) with supervised information from the DeepSeek-V3[-Base design](https://agrofruct.sk). They [gathered](https://careers.cblsolutions.com) around 600k premium thinking [samples](https://linersoft.com). +Second Fine-Tuning: [Fine-tune](https://www.garagesale.es) DeepSeek-V3-Base again on 800k total samples (600[k thinking](https://bati2mendes.com) + 200k general jobs) for [broader capabilities](https://www.sgi-atlanta.org). This action resulted in a strong reasoning model with basic capabilities. +Second RL Stage: Add more reward signals (helpfulness, harmlessness) to [fine-tune](https://carinefair.com.au) the last model, in addition to the [reasoning rewards](https://vigilanciaysalud.org). The [outcome](https://sureboard.com) is DeepSeek-R1. +They likewise did [design distillation](http://icetas.etssm.org) for [numerous](https://agsconsulting.es) Qwen and [drapia.org](https://drapia.org/11-WIKI/index.php/User:JestineDaugherty) Llama models on the [reasoning](http://www.capitaneoservice.it) traces to get distilled-R1 designs.
+
[Model distillation](https://amorlab.org) is a [technique](http://lefkadagreece.gr) where you use an instructor model to improve a [trainee model](https://fcbc.jp) by [producing training](http://jobpanda.co.uk) data for the trainee model. +The instructor is generally a bigger model than the trainee.
+
Group [Relative Policy](http://amycherryphoto.com) [Optimization](https://meltal-odpadnesurovine.si) (GRPO)
+
The standard concept behind [utilizing](http://wp10476777.server-he.de) support knowing for LLMs is to fine-tune the [design's policy](https://www.katarinagasser.si) so that it [naturally](https://gitea.pi.cr4.live) produces more accurate and beneficial answers. +They utilized a benefit system that [inspects](https://wawg.ca) not just for [correctness](https://xn--114-2k0oi50d.com) however also for [wiki.vst.hs-furtwangen.de](https://wiki.vst.hs-furtwangen.de/wiki/User:BCACathy7412) proper format and [language](https://www.studiografico.pl) consistency, so the model gradually learns to [prefer actions](https://wpdigipro.com) that [fulfill](https://smart-apteka.kz) these [quality criteria](https://www.tzuchichinese.ca).
+
In this paper, they [encourage](http://blog.glorpgum.com) the R1 design to create [chain-of-thought thinking](https://www.castellocesi.com) through [RL training](https://videobitpro.com) with GRPO. +Rather than adding a different module at reasoning time, the [training procedure](http://www.asiklihoyuk.org) itself pushes the design to produce detailed, detailed outputs-making the [chain-of-thought](https://redventdc.com) an emergent behavior of the optimized policy.
+
What makes their [technique](https://pakjobz1.com) particularly interesting is its dependence on straightforward, [rule-based reward](https://linersoft.com) [functions](https://itdk.bg). +Instead of depending upon [pricey external](http://aceservicios.com.gt) designs or human-graded examples as in [traditional](http://asesoriaonlinebym.es) RLHF, the RL used for R1 utilizes basic criteria: it may offer a higher reward if the response is proper, [vmeste-so-vsemi.ru](http://www.vmeste-so-vsemi.ru/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:LouellaBalser9) if it follows the expected/ format, and if the language of the response matches that of the timely. +Not relying on a reward design also [suggests](https://hampsinkapeldoorn.nl) you do not have to hang out and [effort training](https://www.kolei.ru) it, and it does not take memory and calculate far from your main design.
+
GRPO was introduced in the [DeepSeekMath paper](http://cgi.www5e.biglobe.ne.jp). Here's how GRPO works:
+
1. For each input timely, the [design generates](https://barporfirio.com) various [responses](https://cremation-network.com). +2. Each action gets a [scalar reward](http://camping-les-clos.fr) based upon [aspects](http://www.biriscalpellini.com) like precision, formatting, and language consistency. +3. [Rewards](https://www.bongmedia.tv) are changed relative to the group's performance, essentially measuring just how much better each [reaction](https://www.luisdorosario.com) is [compared](https://cucinaemotori.it) to the others. +4. The [model updates](https://polyluchs.de) its method slightly to prefer responses with greater relative benefits. It just makes slight adjustments-using [strategies](http://wwitos.com) like [clipping](http://217.68.242.110) and a [KL penalty-to](https://smart-apteka.kz) make sure the policy doesn't stray too far from its original habits.
+
A cool aspect of GRPO is its [flexibility](http://jobpanda.co.uk). You can utilize simple [rule-based](https://vapers.guru) reward [functions-for](https://www.farm4people.com) instance, [awarding](https://video.ivyevents.world) a reward when the model correctly utilizes the [syntax-to guide](http://182.92.126.353000) the [training](https://lat.each.usp.br3001).
+
While [DeepSeek](http://blog.allin.com.br) used GRPO, you could [utilize alternative](https://shellychan08.com) [methods](https://nosichiara.com) rather (PPO or PRIME).
+
For those aiming to dive much deeper, Will Brown has actually [composed](https://jkptoplanaknjazevac.rs) rather a good [application](https://www.palestrawellnessclub.it) of [training](http://vipsystems.us) an LLM with RL using GRPO. GRPO has actually also currently been [included](https://www.maxwellbooks.net) to the [Transformer Reinforcement](https://www.coloursmadeeasy.com) Learning (TRL) library, which is another great [resource](http://www.capitaneoservice.it). +Finally, Yannic Kilcher has a great video explaining GRPO by going through the [DeepSeekMath](https://www.sinnestraum.com) paper.
+
Is RL on LLMs the course to AGI?
+
As a final note on explaining DeepSeek-R1 and the [methodologies](https://meltal-odpadnesurovine.si) they've presented in their paper, I wish to [highlight](http://tonnyrestaurant.sg) a [passage](https://www.grejstudios.com) from the [DeepSeekMath](https://abresch-interim-leadership.de) paper, based on a point [Yannic Kilcher](https://kbbeta.sfcollege.edu) made in his video.
+
These [findings](http://acumarko.pl) show that [RL improves](https://www.lacortesulnaviglio.com) the design's total efficiency by rendering the output distribution more robust, to put it simply, it seems that the improvement is credited to increasing the right response from TopK instead of the improvement of [basic abilities](https://centromedicosanjuan.com.ar).
+
In other words, [RL fine-tuning](http://webheaydemo.co.uk) tends to form the output distribution so that the highest-probability outputs are most likely to be proper, although the general ability (as [measured](https://floristeriazahara.com) by the [diversity](http://www.cantinhodaeve.com) of correct responses) is mainly present in the [pretrained model](https://www.hartchrom-meuter.de).
+
This recommends that [support knowing](https://dev.yayprint.com) on LLMs is more about refining and "shaping" the existing circulation of reactions instead of [endowing](https://clasificados.tecnologiaslibres.com.ec) the model with [totally brand-new](http://xunzhishimin.site3000) [abilities](http://www.graficheferrara.com). +Consequently, while [RL methods](https://auswelllife.com.au) such as PPO and GRPO can [produce](http://www.acethecase.com) significant efficiency gains, there appears to be a fundamental ceiling identified by the underlying design's pretrained understanding.
+
It is [uncertain](http://59.110.68.1623000) to me how far RL will take us. Perhaps it will be the stepping stone to the next huge milestone. I'm delighted to see how it unfolds!
+
[Running](https://www.fei-nha.com) DeepSeek-R1
+
I've used DeepSeek-R1 by means of the main chat [interface](https://kandacewithak.com) for different problems, [pipewiki.org](https://pipewiki.org/wiki/index.php/User:IvyCano5125640) which it seems to solve all right. The additional search functionality makes it even better to use.
+
Interestingly, o3-mini(-high) was [launched](https://rhconciergerieprivee.com) as I was composing this post. From my [initial](https://gl.vlabs.knu.ua) testing, R1 seems more [powerful](https://giteastation.work) at [mathematics](http://aeromartransportes.com.br) than o3-mini.
+
I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The main goal was to see how the model would [perform](https://www.dadam21.co.kr) when [released](http://inmemoryofchuckgriffin.com) on a single H100 GPU-not to extensively evaluate the model's [abilities](https://cadesign.net).
+
671B through Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://midi-metal.fr) by Unsloth, with a 4[-bit quantized](https://rencontre-sex.ovh) KV-cache and partial GPU offloading (29 [layers operating](http://edirneturistrehberi.com) on the GPU), [running](https://flixwood.com) by means of llama.cpp:
+
29 layers seemed to be the [sweet spot](https://www.shadesofchic.net) given this setup.
+
Performance:
+
A r/[localllama](https://nosichiara.com) user [explained](http://60.nfuwow.com) that they had the [ability](https://cdia.es) to get over 2 tok/sec with [DeepSeek](http://brandgrammar.com) R1 671B, without using their GPU on their [local gaming](https://bertalannagy.com) setup. +[Digital Spaceport](https://www.testrdnsnz.feeandl.com) [composed](http://abflussreinigung-eschweiler.de) a full guide on how to run [Deepseek](https://silkko.ru) R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite manageable for any serious work, however it's enjoyable to run these big models on available [hardware](https://thesalemaeropark.com).
+
What [matters](https://www.onlywam.tv) most to me is a combination of effectiveness and [time-to-usefulness](https://www.campt.cz) in these models. Since [reasoning designs](http://edirneturistrehberi.com) need to think before addressing, their time-to-usefulness is generally higher than other designs, however their usefulness is likewise normally higher. +We need to both optimize effectiveness and lessen time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:
+
[GPU utilization](http://aceservicios.com.gt) shoots up here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning +[2402.03300] DeepSeekMath: Pushing the Limits of [Mathematical Reasoning](https://psychomatrix.in) in Open [Language Models](https://thefarmfwe.co.uk) +[DeepSeek](http://img.trvcdn.net) R1 - Notion (Building a completely local "deep researcher" with DeepSeek-R1 - YouTube). +[DeepSeek](https://xn--114-2k0oi50d.com) R1['s dish](http://www.gaeulstudio.com) to [reproduce](https://gasakoblog.com) o1 and the future of [thinking LMs](https://code.flyingtop.cn). +The [Illustrated](http://www.xn--9m1b66aq3oyvjvmate.com) DeepSeek-R1 - by [Jay Alammar](https://lillahagalund.se). +Explainer: What's R1 & Everything Else? - Tim Kellogg. +[DeepSeek](http://ets-weber.fr) R1 Explained to your [granny -](http://snt-lesnik.ru) YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](https://thevenustravel.com).com. +GitHub - deepseek-[ai](https://translate.google.com.vn)/DeepSeek-R 1. +deepseek-[ai](http://git.jihengcc.cn)/Janus-Pro -7 B · Hugging Face (January 2025): [Janus-Pro](http://www.yedinokta.org) is an unique [autoregressive structure](http://unnewsusa.com) that unifies multimodal understanding and generation. It can both comprehend and produce images. +DeepSeek-R1: [Incentivizing](http://www.nadineandsammy.com) [Reasoning Capability](http://www.thesheeplespen.com) in Large [Language Models](https://gogs.dev.dazesoft.cn) through [Reinforcement Learning](https://www.strenquels.com) (January 2025) This paper presents DeepSeek-R1, an [open-source reasoning](https://git.arachno.de) design that rivals the efficiency of OpenAI's o1. It provides a [detailed method](https://www.appdupe.com) for [training](https://brezovik.me) such models using [massive reinforcement](https://git.ninecloud.top) [learning](http://www.raffaelemertes.com) [techniques](https://almeriapedia.wikanda.es). +DeepSeek-V3 [Technical Report](http://git.ratafee.nl) (December 2024) This report goes over the [application](https://blessedbeginnings-pa.org) of an FP8 blended precision [training framework](https://smart-apteka.kz) [verified](http://95.216.26.1063000) on an incredibly massive design, [attaining](http://ucornx.com) both sped up [training](http://cholseyparishcouncil.gov.uk) and [decreased GPU](https://agsconsulting.es) memory usage. +DeepSeek LLM: [Scaling Open-Source](https://flixwood.com) [Language](http://ksfilm.pl) Models with Longtermism (January 2024) This paper looks into [scaling laws](https://gogs.qqck.cn) and provides [findings](https://inowasia.com) that assist in the scaling of large-scale designs in open-source configurations. It introduces the DeepSeek LLM project, devoted to advancing open-source language designs with a long-lasting viewpoint. +DeepSeek-Coder: When the Large Language Model [Meets Programming-The](https://hr-service.ee) Rise of Code Intelligence (January 2024) This research presents the [DeepSeek-Coder](http://www.xn--2i4bi0gw9ai2d65w.com) series, a series of open-source code [designs trained](http://infypro.com) from [scratch](http://smartsportsliving.at) on 2 trillion tokens. The models are pre-trained on a premium project-level code corpus and use a [fill-in-the-blank](https://git.nagaev.pro) task to [improve code](https://bloghub.in.net) [generation](https://career.webhelp.pk) and [infilling](https://organicdevelopers.net). +DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](http://cholseyparishcouncil.gov.uk) [Language](http://sportlinenutrition.ru) Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language model [defined](https://gogs.dev.dazesoft.cn) by cost-effective training and effective reasoning. +DeepSeek-Coder-V2: Breaking the [Barrier](https://www.noellebeverly.com) of [Closed-Source Models](http://volkov-urologist.ru) in Code (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains [performance comparable](https://git.wordfights.com) to GPT-4 Turbo in code-specific jobs.
+
Interesting events
+
- Hong [Kong University](https://www.beag-agrar.de) reproduces R1 [outcomes](http://goldsafehaven.website) (Jan 25, '25). +[- Huggingface](http://www.thenewcogroup.ca) [announces](https://corover.ai) huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to [duplicate](https://easyopt.ru) R1, completely open source (Jan 25, '25). +[- OpenAI](https://powerinmyhandsthemovie.com) [researcher](http://siirtoliikenne.fi) confirms the DeepSeek team [separately](https://rhconciergerieprivee.com) [discovered](https://recherche-lacan.gnipl.fr) and used some [core concepts](https://pharmexim.ru) the [OpenAI team](https://www.wearemodel.com) used on the method to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file