《【英文原版】StableDiffusion3技术报告-英.docx》由会员分享,可在线阅读,更多相关《【英文原版】StableDiffusion3技术报告-英.docx(30页珍藏版)》请在课桌文档上搜索。
1、ScalingRectifledFlowTransformersforHigh-ResolutionImageSynthesisPatrickEsserSumithKulalAndreasBlattmannRahimEntezariJonasMu,llerHarrySainiYam1.eviDominik1.orenzAxelSauerFredericBoeselDustinPodelITimDockhornZionEnglishKyle1.aceyAlexGoodwinYannikMarekRobinRombach*StabilityAIFigure1.High-resolutionsamp
2、lesfromour8Brectifiedflowmodel,showcasingitscapabilitiesintypography,precisepromptfollowingandspatialreasoning,attentiontofinedetails,andhighimagequalityacrossawidevarietyofstyles.AbstractDiffusionmodelscreatedatafromnoisebyinvertingtheforwardpathsofdatatowardsnoiseandhaveemergedasapowerfulgenerativ
3、emodelingtechniqueforhigh-dimensional,perceptualdatasuchasimagesandvideos.Rectifiedflowisarecentgenerativemodelformulationthatconnectsdataandnoiseinastraightline.Despiteitsbettertheoreticalpropertiesandconceptualsimplicity,itisnotyetdecisivelyestablishedasstandardpractice.Inthiswork,weimproveexistin
4、gnoisesamplingtechniquesfbrtrainingrectifiedflowmodelsbybiasingthemtowardsperceptuallyrelevantscales.Throughalarge-scalestudy,wedemon-4Equalcontribution.stability.ai.stratethesuperiorperformanceofthisapproachcomparedtoestablisheddiffusionformulationsforhigh-resolutiontext-to-imagesynthesis.Additiona
5、lly,wepresentanoveltransformer-basedarchitecturefortext-to-imagegenerationthatusesseparateweightsforthetwomodalitiesandenablesabidirectionalflowofinformationbetweenimageandtexttokens,improvingtextcomprehension,typography,andhumanpreferenceratings.Wedemonstratethatthisarchitecturefollowspredictablesc
6、alingtrendsandcorrelateslowervalidationlosstoimprovedtext-to-imagesynthesisasmeasuredbyvariousmetricsandhumanevaluations.Ourlargestmodelsoutperformstate-of-the-artmodels,andwewillmakeourexperimentaldata,code,andmodelweightspubliclyavailable.1. IntroductionDiffusionmodelscreatedatafromnoise(Songetal.
7、,2020).Theyaretrainedtoinvertforwardpathsofdatatowardsrandomnoiseand,thus,inconjunctionwithapproximationandgeneralizationpropertiesofneuralnetworks,canbeusedtogeneratenewdatapointsthatarenotpresentinthetrainingdatabutfollowthedistributionofthetrainingdata(Sohl-Dicksteinetal.,2015;Song&Ermon,2020).Th
8、isgenerativemodelingtechniquehasproventobeveryeffectiveformodelinghigh-dimensional,perceptualdatasuchasimages(HOetal.,2020).Inrecentyears,diffusionmodelshavebecomethede-factoapproachforgeneratinghigh-resolutionimagesandvideosfromnaturallanguageinputswithimpressivegeneralizationcapabilities(Sahariaet
9、al.,2022b;Rameshetal.,2022;Rombachetal.,2022;Podelletal.,2023;Daietal.,2023;Esseretal.,2023;Blattmannetal.,2023b;Betkeretal.,2023;Blattmannetal.,2023a;Singeretal.l2022).Duetotheiriterativenatureandtheassociatedcomputationalcosts,aswellasthelongsamplingtimesduringinference,researchonformulationsformo
10、reefficienttrainingand/orfastersamplingofthesemodelshasincreased(Karrasetal.,2023;1.iuetal.,2022).Whilespecifyingaforwardpathfromdatatonoiseleadstoefficienttraining,italsoraisesthequestionofwhichpathtochoose.Thischoicecanhaveimportantimplicationsforsampling.Forexample,aforwardprocessthatfailstoremov
11、eallnoisefromthedatacanleadtoadiscrepancyintrainingandtestdistributionandresultinartifactssuchasgrayimagesamples(1.inetal.,2024).Importantly,thechoiceoftheforwardprocessalsoinfluencesthelearnedbackwardprocessand,thus,thesamplingefficiency.Whilecurvedpathsrequiremanyintegrationstepstosimulatetheproce
12、ss,astraightpathcouldbesimulatedwithasinglestepandislesspronetoerroraccumulation.Sinceeachstepcorrespondstoanevaluationoftheneuralnetwork,thishasadirectimpactonthesamplingspeed.Aparticularchoicefortheforwardpathisaso-calledRectifiedFlow(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023),w
13、hichconnectsdataandnoiseonastraightline.Althoughthismodelclasshasbettertheoreticalproperties,ithasnotyetbecomedecisivelyestablishedinpractice.Sofar,someadvantageshavebeenempiricallydemonstratedinsmallandmedium-sizedexperiments(Maetal.,2024),butthesearemostlylimitedtoclass-conditionalmodels.Inthiswor
14、k,wechangethisbyintroducingare-weightingofthenoisescalesinrectifiedflowmodels,similartonoise-predictivediffusionmodels(Hoetal.,2020).Throughalarge-scalestudy,wecompareournewformulationtoexistingdiffusionformulationsanddemonstrateitsbenefits.Weshowthatthewidelyusedapproachfortext-to-imagesynthesis,wh
15、ereafixedtextrepresentationisfeddirectlyintothemodel(e.g.,viacross-attention(Vaswanietal.,2017;Rombachetal.,2022),isnotideal,andpresentanewarchitecturethatincorporatesIeamablestreamsforbothimageandtexttokens,whichenablesatwo-wayflowOfinformationbetweenthem.Wecombinethiswithourimprovedrectifiedflowfo
16、rmulationandinvestigateitsscalability.Wedemonstrateapredictablescalingtrendinthevalidationlossandshowthatalowervalidationlosscorrelatesstronglywithimprovedautomaticandhumanevaluations.Ourlargestmodelsoutperformstate-of-theartopenmodelssuchasSDX1.(Podelletal.,2023),SDX1.-Turbo(Saueretal.,2023),Pixart
17、-(Chenetal.,2023),andclosed-sourcemodelssuchasDA1.1.-E3(Betkeretal.,2023)bothinquantitativeevaluation(Ghoshetal.,2023)ofpromptunderstandingandhumanpreferenceratings.Thecorecontributionsofourworkare:(i)Weconductalarge-scale,systematicstudyondifferentdiffusionmodelandrectifiedflowformulationstoidentif
18、ythebestsetting.Forthispurpose,weintroducenewnoisesamplersforrectifiedflowmodelsthatimproveperformanceoverpreviouslyknownsamplers,(ii)Wedeviseanovel,scalablearchitecturefortext-to-imagesynthesisthatallowsbi-directionalmixingbetweentextandimagetokenstreamswithinthenetwork.Weshowitsbenefitscomparedtoe
19、stablishedbackbonessuchasUViT(Hoogeboometal,2023)andDiT(Peebles&Xie,2023).Finally,we(iii)performascalingstudyofourmodelanddemonstratethatitfollowspredictablescalingtrends.Weshowthatalowervalidationlosscorrelatesstronglywithimprovedtext-to-imageperformanceassessedviametricssuchasT2I-CompBench(Huanget
20、al.,2023),GenEval(Ghoshetal.,2023)andhumanratings.Wemakeresults,code,andmodelweightspubliclyavailable.2. Simulation-FreeTrainingofFlowsWeconsidergenerativemodelsthatdefineamappingbetweensamplesifromanoisedistributionPltosamplesxofromadatadistributionpointermsofanordinarydifferentialequation(ODE),dyt
21、=v-,r)dt,(1)wherethevelocityvisparameterizedbytheweightsofaneuralnetwork.PriorworkbyChenetal.(2018)suggestedtodirectlysolveEquation(1)viadifferentiableODEsolvers.However,thisprocessiscomputationallyexpensive,especiallyforlargenetworkarchitecturesthatparameterizev-(,t.t).Amoreefficientalternativeisto
22、directlyregressavectorfieldwtthatgeneratesaprobabilitypathbetweenPOandp.Toconstructsuchaux,wedefineaforwardprocess,correspondingtoaprobabilitypathPtbetweenpoandPl=N(0,1),aszt=auo+btawherexoand,weintroducetanduxast():xoato+Z(4)Mze):=ItT(Z付(5)SinceZtcanbewrittenassolutiontotheODEzt=t(ZtI),withinitialv
23、aluezo=xo,wt()generatespt(e).Remarkably,onecanconstructamarginalvectorfield“twhichgeneratesthemarginalprobabilitypaths(1.ipmanetal.,2023)(seeB.l),usingtheconditionalvectorfields11t():(z)=EufzelAtUl(6)tSN(OJ)八Pt(Z)WhileregressingwlwiththeFlowMatchingobjective1.FM=Et,pt(z)Hv-(z,z)wt(z)112.(7)directlyi
24、sintractableduetothemarginalizationinEquation6,ConditionalFlowMatching(seeB.l),1.CFM=Et,p,(z/e),pIW-Qt)Mt(z)22.(8)withtheconditionalvectorfieldsMt(Z)providesanequivalentyettractableobjective.ToconvertthelossintoanexplicitformWeinsert/(xo)=cixq+heandt_1(z)=ycinto(5)Ztz=NZtl)=-Zi_庆(一4)(9)tv,jatatbtNow
25、,CQnSi尊thesignal-to-noiseratioA:=ogWith,=2(a),wecanrewriteEquation(9)astarz(ze)=tz,(10)ttTt?lCli乙Next,weuseEquation(10)toreparameterizeEquation(8)asanoise-predictionobjective:1.=Evz,i)-az+,e2(11)CFMt,pt(ze),p(e)M2t2rJ&,.22_Et,pjze),p(e)2-(z,/)e2储)wherewedenedc:=2(v6z).T人组募Notethattheoptimumoftheabov
26、eobjectivedoesnotchangewhenintroducingatime-dependentweighting.Thus,onecanderivevariousweightedlossfunctionsthatprovideasignaltowardsthedesiredsolutionbutmightaffecttheoptimizationtrajectory.Foraunifiedanalysisofdifferentapproaches,includingclassicdiffusionformulations,wecanwritetheobjectiveinthefol
27、lowingform(followingKingma&Gao(2023):T,H-1.w(x0)=-2EU(t),SN(0,1)VYMJ(Z/)12,where12correspondsto.Wt=-ZAE1.CFM3. FlowTrajectoriesInthiswork,weconsiderdifferentvariantsoftheaboveformalismthatwebrieflydescribeinthefollowing.RectifiedFlowRectifiedFlows(RFs)(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ip
28、manetal.,2023)definetheforwardprocessasstraightpathsbetweenthedatadistributionandastandardnormaldistribution,i.e.zt=(1r)xo+te,(13)anduses1.CFMwhichthencorrespondstovvjJf=rl.Thenetworkoutputdirectlyparameterizesthevelocityv-.EDMEDM(Karrasetal.,2022)usesaforwardprocessoftheformzt=xo+bt(14)WherJ(Kingma
29、&Gao,2023)bt=exp/7T1(Pm,P)withFjbeingthequantilefunctionofthenormaldistributionwithmeanPmandvarianceP2.NotethatthischoiceSresultsinAN(2Pm,(2A)z)fort-U(0,1)(15)ThenetworkisparameterizedthroughanF-prediction(Kingma&Gao,2023;Karrasetal.,2022)andthelosscanbewrittenas1.weomwithtWqDM=N(42Pm,(2Ps)2)(ef+o.5
30、2)(16)Cosine(Nichol&Dhariwal,2021)proposedaforwardprocessoftheform1111,一Zt=cosrxo+sin/.(17)Incombinationwithane-parameterizationandloss,thiscorrespondstoaweighting1伏=SeChat/2).Whencombinedwithav-predictionloss(Kingma&Gaof2023),theweightingisgivenbyWt=ei2.(1.DM-)1.inear1.DM(Rombachetal.,2022)usesamod
31、ificationoftheDDPMschedule(Hoetalg020).BOtharevariancepreservingschedules,i.e.bt=%anddefineatfordiscretetimestepst=0,.T1intermsofdiffusioncoefficientsBtasat=(:=o(lA),Forgivenboundaryvaluesoand-,DDPMusesBt=6。+t1So)and1.DMusesBt=Yp*2”防十占位际一3.1. TailoredSNRSamplersforRFmodelsTheRFlosstrainsthevelocityv
32、-,uniformlyonalltimestepsin0,1.Intuitively,however,theresultingvelocitypredictiontargetcxoismoredifficultfortinthemiddleof0.1,sincefort=Q,theoptimalpredictionisthemeanofpi,andfor/=1theoptimalpredictionisthemeanofpo.Ingeneral,changingthedistributionovertfromthecommonlyuseduniformdistributionU。)toadis
33、tributionwithdensityt)isequivalenttoaweightedloss1.wwitht=1,77W(18)1-IThus,weaimtogivemoreweighttointermediatetimestepsbysamplingthemmorefrequently.Next,wedescribethetimestepdensities()thatweusetotrainourmodels.1.ogit-NormalSamplingOneoptionforadistributionthatputsmoreweightonintermediatestepsisthel
34、ogitnormaldistribution(Atchison&Shen,1980).Itsdensity,f11*(logit(r)_my*-0;用,5)=,2面(1_,)exp-Wwherelogit()=logJ_,hasalocationparameter,w,l-tascaleparameter,s.ThelocationparameterenablesustobiasthetrainingtimestepstowardseitherdataPo(negativeniornoisePl(positiven).AsshowninFigure11,thescaleparametersco
35、ntrolshowwidethedistributionis.Inpractice,wesampletherandomvariableufromanormaldistributionu,几S)andmapitthroughthestandardlogisticfunction.ModeSamplingwithHeavyTailsThelogit-normaldensityalwaysvanishesattheendpointsOand1.Tostudywhetherthishasadverseeffectsontheperformance,wealsouseatimestepsamplingd
36、istributionwithstrictlypositivedensityon0,1.Forascaleparameters,wedefinei11找,/mode(WJS)=IUS,COS?U1+M.(20)For-1s7thisfunctionismonotonic,andwecanuseittosamplefromtheimplieddensityrrmode(cS)=目温M).AsseeninFigure11,thescaleparametercontrolsthedegreetowhicheitherthemidpoint(positiveS)ortheendpoints(negat
37、iveS)arefavoredduringsampling.Thisformulationalsoincludesauniformweightingmode。;s=O)=U(f)fors=O,whichhasbeenusedwidelyinpreviousworksonRectifiedFlows(1.iuetal.,2022;Maetal.,2024).CosMapFinally,wealsoconsiderthecosineschedule(Nichol&Dhariwal,2021)fromSection3intheRFsetting.Inparticular,wearelookingfo
38、ramapping/:u(m)=t,ul(1.11.suchthatthelos-snrmatchesthatofthecosineschedUej21ogcosG=勿i.SolvingtorAwesin(u)f(u)obtainforuU(m)Z=小)=1一_(21)tan(2m)+1fromwhichweobtainthedensityd2CosMapQ)=_广1。)=.(22)dt11-211+211t24. Text-to-ImageArchitectureFortext-conditionalsamplingofimages,ourmodelhastotakebothmodaliti
39、es,textandimages,intoaccount.Weusepretrainedmodelstoderivesuitablerepresentationsandthendescribethearchitectureofourdiffusionbackbone.AnoverviewofthisispresentedinFigure2.Ourgeneralsetupfollows1.DM(Rombachetal.f2022)fortrainingtext-to-imagemodelsinthelatentspaceofapretrainedautoencoder.Similartothee
40、ncodingofimagestolatentrepresentations,wealsofollowpreviousapproaches(Sahariaetal.,2022b;Balajietal.,2022)andencodethetextconditioningcusingpretrained,frozentextmodels.DetailscanbefoundinAppendixB.2.MultimodalDiffusionBackboneOurarchitecturebuildsupontheDiT(Peebles&Xie,2023)architecture.DiTonlyconsi
41、dersclassconditionalimagegenerationandusesamodulationmechanismtoconditionthenetworkonboththetimestepofthediffusionprocessandtheclasslabel.Similarly,weuseembeddingsofthetimesteptandCVeCasinputstothemodulationmechanism.However,asthepooledtextrepresentationretainsonlycoarse-grainedinformationaboutthete
42、xtinput(Podelletal.,2023),thenetworkalsorequiresinformationfromthesequencerepresentationCCtXtWeconstructasequenceconsistingofembeddingsofthetextandimageinputs.Specifically,weaddpositionalencodingsandflatten22patchesofthelatentpixelrepresentationxRhXWXCtoapatchencodingsequenceoflength17w.Afterembeddi
43、ngthispatchencodingandthetextencodingCCtxttoacommondimensionality,weCaPtiOn)(八)Overviewofallcomponents.Figure2.Ourmodelarchitecture.Concatenationisindicatedbyandelement-wisemultiplicationby.TheRMS-NormforQandKcanbeaddedtostabilizetrainingruns.Bestviewedzoomedin.(b)OneAfM-DzTblockconcatenatethetwoseq
44、uences.WethenfollowDiTandapplyasequenceofmodulatedattentionandM1.Ps.Sincetextandimageembeddingsareconceptuallyquitedifferent,weusetwoseparatesetsofweightsforthetwomodalities.AsshowninFigure2b,thisisequivalenttohavingtwoindependenttransformersforeachmodality,butjoiningthesequencesofthetwomodalitiesfo
45、rtheattentionoperation,suchthatbothrepresentationscanworkintheirownspaceyettaketheotheroneintoaccountForourscalingexperiments,Weparameterizethesizeofthemodelintermsofthemodesdepthd,i.e.thenumberofattentionblocks,bysettingthehiddensizeto64d(expandedto464channelsintheM1.Pblocks),andthenumberofattentio
46、nheadsequaltod.5. Experiments5.1. ImprovingRectifiedFlowsWeaimtounderstandwhichoftheapproachesforsimulation-freetrainingofnormalizingflowsasinEquation1isthemostefficientToenablecomparisonsacrossdifferentapproaches,wecontrolfortheoptimizationalgorithm,themodelarchitecture,thedatasetandsamplers.Inaddi
47、tion,thelossesofdifferentapproachesareincomparableandalsodonotnecessarilycorrelatewiththequalityofoutputsamples;henceweneedevaluationmetricsthatallowforacomparisonbetweenapproaches.WetrainmodelsonIma-geNet(Russakovskyetal.,2014)andCC12M(Changpinyoetal.,2021),andevaluateboththetrainingandtheEMAweightsofthemodelsduringtrainingusingvalidationlosses,C1.IPscores(Radfordetal.,2021;Hesseletal”2021)fandFlD(Heuseletal.f2017)underdifferentsamplersettings(differentguidancescalesandsamplingsteps).Wecalc