Deep Latent Dirichlet Allocation - Mingyuan Zhou

2MB Size 4 Downloads 8 Views

Aug 1, 2016 ... (b). Figure: Point-Perplexity results versus time. (a) RCV1. (b) Wiki. ... nissan electronics wagon altima delcoelect kocrsv station gm subaru.
Gamma Belief Networks (Deep Latent Dirichlet Allocation)

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Mingyuan Zhou (Joint work with Yulai Cong and Bo Chen at Xidian University) IROM Department, McCombs School of Business The University of Texas at Austin Duke-Tshinghua Machine Learning Summer School Duke Kunshan Univesrity, Kunshan, China August 1, 2016

1 / 48

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Introduction

Deep learning I

Significant recent interest in deep learning due to its excellent performance in large-scale real applications, such as image classification and speech recognition.

I

State-of-the-art results in supervised learning when the labeled data are abundant.

I

Significant potential in unsupervised learning with deep models Deep generative models for nonlinear distributed representations:

I

I I

I

SBN, sigmoid belief network DBN, deep belief network (a SBN whose last layer is replaced with a restricted Boltzmann machine that is undirected) DBM, deep Boltzmann machine (a hierarchy of restricted Boltzmann machines) 2 / 48

Restrictions of previous deep generative models I

The hidden units are often restricted to be binary.

I

Difficult to train a deep network in an unsupervised manner.

I

A greedy layer-wise training strategy is often used due to the difficulty of jointly training all hidden layers.

I

Lack of principled ways to determine the network structure, including the depth (number of layers) of the network and width (number of units) of each of its hidden layers.

I

Commonly used deep learning models are not naturally designed for count data.

Our objectives

I

Design a multilayer deep generative model that is well suited for extracting nonlinear distributed representations for high-dimensional sparse count, binary, and nonnegative real vectors.

I

Construct the deep network using nonnegative real hidden units rather than using binary ones.

I

Using nonparametric Bayesian priors to automatically infer the network structure from the data.

5_2

x_5

5_1

4_3

4_2

4_1

3_3

3_1

3_2

2_2

2_3

2_4

2_1

1_4

1_5

1_3

1_2

1_1

x_4

x_8

x_2

x_7

x_1

1_6

x_6

x_3

Figure: An example directed network of five hidden layers, with K0 = 8 visible units, [K1 , K2 , K3 , K4 , K5 ] = [6, 4, 3, 3, 2], and sparse connections between the hidden units of adjacent layers.     (1) (t) (1) (1) P x j , {θ j }t {Φ(t) }t = P x j Φ(1) , θ j hQ  i   (t) (t+1) (t+1) (T ) T −1 × , θj P θj . t=1 P θ j Φ

2 god believe jesus true question said new christian bible life moral objective

5 god jesus believe christian bible true question said life christians faith religion

21 objective morality moral keith frank values livesey jon wrong caltech sgi isn

3 god jesus believe christian 33 islam islamic muslims muslim jaeger gregg au qur bible true question life monash bu women rushdie christians said faith christ

24 objective morality moral keith frank values livesey jon wrong caltech sgi isn

5 god jesus christian bible 34 islam islamic muslims muslim 17 god science believe evidence 27 objective morality moral keith jaeger gregg au qur frank values livesey jon atheism question exist true christ believe christians life monash bu rushdie women wrong caltech sgi isn atheists existence reason belief man faith said come

101 matthew prophecy tomb king messiah prophecies isaiah jesus testament disciples josephus psalm

4 god jesus christian bible christ christians life faith believe man christianity church

55 uga georgia michael ai covington mcovingt easter programs jayne athens amateur research

64 sandvik kent apple newton private alink ksand activities net cheers order royalroads

25 jews 38 islam 39 uk israel islamic ac israeli muslims ed jewish muslim picture arab jaeger dcs arabs gregg sleeve liverpool qur land monash demon center tony bu policy simon au anti oxford research rushdie women warwick palestine

1 doesn 2 michael new netcom problem andrew work new probably uucp let internet mark said steve years opinions ll mike long mail question net course

70 umd wam mangoe wingate charley text peace luke revelation kmr matthew po

14 law 19 god government science rights evidence state atheism court exist laws atheists existence states believe public belief case true civil atheist legal religion federal

Figure: A tree on “religion.”

88 truth absolute absolutes scripture christians truths arrogant true arrogance bible believe claim

63 rit isc ultb bobby mozumder snm religious religion eric mom atheist men

29 objective 45 uiuc morality 77 tek cso vice moral uxa ico keith ux bobbe frank illinois beauchaine values urbana robert bronx josh livesey cobb manta jon bob cka caltech sea hopkins sgi queens champaign blew murder amin natural

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Hierarchical model

The Poisson gamma belief network (PGBN) I

I

I

(1)

Assume the observations are multivariate count vectors x j ∈ ZK0 , where Z = {0, 1, . . .}. We construct the PGBN [Zhou, Cong & Chen, 2015] to infer a (1) multilayer deep representation for {x j }j . K

×K

With Φ(t) ∈ R+t−1 t , the generative model of the PGBN with T hidden layers, from the top to bottom, is expressed as    (T ) (T +1) θ j ∼ Gam r , 1 cj , ··· (t) θj



(t+1)

∼ Gam Φ(t+1) θ j

 (t+1)  , 1 cj ,

··· (1) xj

∼ Pois



(1) Φ(1) θ j



,

(1) θj

  (2) (2)  (2)  ∼ Gam Φ(2) θ j , pj 1 − pj . 7 / 48

I

The PGBN factorizes the observed count vectors under the Poisson likelihood into the product of a factor loading matrix and the gamma distributed hidden units of layer one.

I

The PGBN factorizes the hidden units of each hidden layer into the product a connection weight matrix and the hidden units of the next layer under the gamma likelihood.

I

The PGBN with a single hidden layer (i.e., T = 1) reduces to Poisson factor analysis as     (1) (1) (1) (2)  (2)  x j ∼ Pois Φ(1) θ j , θ j ∼ Gam r , pj 1 − pj . The gamma-negative binomial process [Zhou & Carin, 2015] can be used to support potentially K1 = ∞ number of factors.

Deep Latent Dirichlet Allocation (DLDA)

I

(1)

With qj

(t+1)

= 1, qj

  (t) (t) (t+1) (t) := ln 1 + qj /cj , and pj := 1 − e −qj ,

(t)

and with all θjk marginalized out, one may re-express the PGBN hierarchical model as deep latent Dirichlet allocation (DLDA) as (T )(T +1)

mkj (t−1)(t)

mvj

(T +1)

∼ SumLog(xkj (t)

(t)

(t)

∼ SumLog(xvj , pj ), xvj

(T +1)

(T +1)

(T +1)

), xkj ∼ Pois(rk qj ), ··· Kt     X (t) (t) (t)(t+1) (t) = xvkj , xvkj ∼ Mult mkj , φk , , pj

v

k=1

··· (1)

xvj =

K1 X k=1

(1)

xvkj ,



(1)

xvkj

 v

  (1)(2) (1) ∼ Mult mkj , φk .

Deep Latent Dirichlet Allocation (DLDA) I

(T +1)

(T +1)

With p˜ := q· /(c0 + q· ) and the gamma process weights rk marginalized out, one may re-express the PGBN hierarchical model as X (T +1) =

KT X

(T +1)

xk·

(T +1)

k

k=1 (T )(T +1)

mvj

∼ Log(p), ˜ KT ∼ Pois[−γ0 ln(1 − p)], ˜ h i (T +1) (T +1)   (T +1) (T +1)  ∼ Mult x , q q , xvj v · · j 1,J 1,J

δφ(T ) , xk·

(T +1)

∼ SumLog(xvj

(T +1)

, pj

),

··· (t−1)(t)

mvj

(t)

(t)

(t)

∼ SumLog(xvj , pj ), xvj =

Kt X

(t)

xvkj ,

k=1



(t)

xvkj

 v

  (t)(t+1) (t) ∼ Mult mkj , φk ,

··· (1)

xvj =

K1 X

(1)

xvkj ,



k=1

I

(1)

xvkj

(T )(T +1)

 v

  (1)(2) (1) ∼ Mult mkj , φk .

The count matrix {mvj }j=1,J; v =1,KT is drawn from a gamma-negative binomial process [Zhou, Padilla & Scott, 2016]

Model likelihood for PGBN I

The joint distribution of the observed counts and gamma hidden units given the network in the PGBN:     (1) (t) (1) (1) P x j , {θ j }t {Φ(t) }t = P x j Φ(1) , θ j i    hQ (T ) (t) (t+1) (t+1) T −1 P θ . P θ , θ × Φ j j j t=1

P I



(t) θvj

 (t+1) (t+1) (t+1) = , cj+1 φv : , θ j



 (t+1) θ(t+1) (t+1) φv : j cj+1   (t+1) (t+1) Γ φv : θ j



(t)

θvj

(t+1) φ(t+1) θj −1 v:

(t+1) (t) θvj

e −cj+1

The joint distribution of the binary visible and hidden units given the network for the sigmoid belief network (SBN):     (t) (t+1) (t+1) (t+1) (t+1) (t+1) P θvj = 1 φ(t+1) , θ , b = σ b + φ θ . v: v v v: j j where σ(x) = 1/(1 + e −x ) denotes the sigmoid function.

Model likelihood for DLDA I

The likelihood of Φ(t) in deep latent Dirichlet allocation (DLDA) can be expressed as Kt T Y h  i   Y (t) (t)(t+1) (t) (t) P {x j }t {Φ(t) }t , r ∝ Multinomial xvkj ; mkj , φk v t=1 k=1

×

KT Y

(T +1)

Pois(xkj

(T +1)

; rk qj

).

k=1 (t)

I

The expected Fisher information matrix for the global parameters φk and r is block diagonal.

I

Stochastic gradient MCMC for DLDA is simple under this likelihood.

I

The likelihood becomes the same as that of latent Dirichlet allocation (LDA) [Blei, Ng & Jordan, 2003] if T=1.

Interpreting the structure of the PGBN I

Using the law of total expectation, we have " t # (t) Y θj   (1) (t) (`) (`) (`) Φ E x j θ j , {Φ , cj }1,t = . Qt (t) `=1 `=2 cj "  (t)  (`) E θ j {Φ(`) , cj }t+1,T , r =

T Y `=t+1

Consider

#

r Φ(`) Q T +1

(`) `=t+1 cj

.

Qt

(`) `=1 Φ

I

as the Kt topics/factors/nodes of layer hQ i T (`) (t) t ∈ {1, . . . , T } and use r := r to rank them. `=t+1 Φ

I

Consider φk 0 k = Φ(t) (k 0 , k) as the weight that connects node k of layer t and node k 0 of layer t − 1.

I

Our intuition is that examining the topics/factors from the top to bottom layers will gradually reveal less general and more specific aspects of the data.

(t)

1 doesn new problem work probably let said years ll long question course

11 team 21 disease 31 key game 41 henry doctor chip toronto hockey pain bit zoo season treatment keys spencer games medical number work nhl patients serial utzoo medicine encrypted zoology year man bits cancer league umd clipper help teams kipling escrow skin players eng algorithm blood allen play des problems cup

2 michael 12 sale netcom new andrew shipping new offer uucp condition internet price asking mark sell steve cover opinions cd mike interested mail best net 3 thanks 13 power mail output email input help high fax signal advance low voltage looking chip hi radio info battery information circuit phone data send

51 ground wire wiring neutral outlets circuit cable electrical wires outlet connected house

61 callison uoknor james car ecn lights uokmax fake bird continental alarm slb

71 candida yeast noring steve symptoms jon vitamin dyer body quack patients sinus

22 space 32 israel 42 graphics 52 won 62 sweden nasa israeli ftp lost fi earth lebanese pub san finland orbit lebanon data new germany gov attacks ray kaldis canada data peace risc rutgers players finnish york contact arab shuttle german houston sgi israelis mission april st machines civilians lunar swedish astros pc killed spacecraft wc louis palestinian instruction czech solar reds mail soldiers jpl

23 gun 33 cramer 43 ohio 53 printer 63 rit guns men state hp isc firearms gay magnus print ultb crime homosexual acs laser bobby control optilink drugs ink mozumder weapons sexual ncr printers snm religious deskjet atlantaga clayton self religion postscript wilson sex criminals eric bj ncratl homosexuals defense mom canon legal handgun homosexuality atheist quality war male men firearm toner ryan virginia violent

81 sex child copy women children protection pregnancy depression abstinence sexual education sehari

72 gatech prism gt georgia atlanta technology institute internet uucp gilkey hydra lankford

82 hall dave fame smith eddie murray winfield kingman yount steve guys bsu

73 msg dyer spdcc glutamate food effects brain studies humans blood olney foods

83 radar detector detectors police car beam illegal antenna radio virginia law band

Figure: Example topics of layer one of the PGBN learned on the 20newsgroups corpus.

1 thanks mail email information 2 mac apple thanks bit 3 god jesus believe christian 4 car cars new engine 5 windows dos file files miles dealer price drive thanks problem program using bible true question life mhz card modem ram help fax software info run running win ms year road buy speed speed port board memory christians said faith christ advance looking new hi 6 team game hockey season 7 year game team baseball 8 god jesus christian bible 9 bike dod ride bikes 10 window thanks widget application using display available program runs games season players christ believe christians church riding motorcycle got sun games nhl year play mail server set software left new said ll life faith man said hit win league years league new players teams 11 power thanks mail output 12 government key encryption clipper 13 gun guns firearms crime 14 disease doctor pain new 15 card monitor video vga control law weapons state help treatment medical years windows thanks drivers cards chip nsa phone public input high work line mode mail graphics ati new self government said said problem thanks patients keys secure crypto escrow using low phone circuit

16 drive scsi disk hard 17 space nasa new gov 18 israel jews israeli arab 19 new sale shipping offer 20 space nasa new information gov earth data orbit price mail condition drive drives mb windows thanks work henry long doesn jewish state arabs peace thanks asking email interested research program shuttle national land policy new years said believe let ll controller floppy mail ide 21 koresh fbi batf believe 22 new sale shipping thanks 23 new uiuc believe said 24 objective morality moral keith 25 tax clinton taxes government new money ll house frank values livesey jon cso read let ll mail offer price condition gas compound children stratus pay look bush congress wrong caltech sgi isn doesn says long post email interested asking best waco said atf government 26 law new state government 27 men cramer gay homosexual 28 space nasa new work 29 pitt gordon banks geb 30 turkish armenian armenians turks skepticism soon cadre intellect armenia turkey government genocide gov henry long doesn sexual optilink clayton sex said president mr rights soviet said today new chastity jxp surrender shameful ll said believe let states information public national homosexuals homosexuality male state

Figure: The top 30 topics of layer three of the PGBN learned on the 20newsgroups corpus.

1 windows thanks file dos 2 god believe jesus true 3 car bike new cars 4 thanks mail email help 5 card thanks new mac apple mail drive bit window using mail program question said new christian dod engine got thanks information fax software info work video mb problem new advance looking hi price road ll miles bible life moral objective problem help files work 6 space nasa new gov 7 team game hockey season 8 god jesus christian bible 9 key chip government clipper 10 card thanks mac windows apple mail drive bit believe christ christians life encryption keys nsa phone games nhl year play work years long earth problem work mb new public escrow secure new new church said man new league players teams said orbit ll year 11 car bike new cars 12 thanks new mail power 13 year game team baseball 14 disease new doctor pain 15 israel jews israeli armenian turkish new state government help treatment said years games runs season players email work phone high dod engine got road said armenians years law thanks problem medical work hit win league years help sale price using said ll miles ride 16 gun koresh fbi believe 17 thanks drive card mail 18 gun guns firearms crime 19 tax clinton government new 20 god jesus believe new said christian bible life taxes law ll state law control new state work mac new bit batf said government children read day says doesn said money believe years apple problem power mb weapons government said believe guns new gas compound 21 drive disk scsi hard 22 thanks mail information email 23 israel jews israeli arab 24 pitt gordon banks geb 25 new sale shipping offer skepticism soon cadre intellect price mail thanks condition jewish state new arabs new help fax software drives mb thanks windows chastity jxp surrender shameful drive asking email interested peace land years true space info available send controller mail floppy ide 26 team game year season 27 men cramer gay new 28 god jesus christian bible 29 year game team baseball 30 new mail thanks key games hockey players play homosexual sexual optilink said believe christ christians life games runs season players internet information email work number ll read believe hit new years win new church said faith believe state government law league new win baseball

Figure: The top 30 topics of layer five of the PGBN learned on the 20newsgroups corpus.

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Hierarchical model Visualization

Visualize the inferred deep network

I

Visualize the whole network is challenging if the network is large.

I

But we can easily visualize a tree or a subnetwork that consists of multiple trees.

I

Construct a tree starting from a top-layer node: grow the tree downward by linking each leaf node of the tree to all the hidden units at the layer below that are connected to the leaf node with nonnegligible weights.

17 / 48

3 car bike new cars 11 car bike new cars dod engine got road dod engine got thanks said ll miles ride price road ll miles

2 car bike new cars dod engine got road miles ride said ll

9 bike dod ride bikes riding motorcycle got sun left new said ll

8 bike dod ride bikes riding motorcycle got sun left new said ll

119 helmet eskimo shoei mavenry norman size head maven altcit paint passenger helmets

75 behanna nec nj lock syl chris dod cb pack bike ll wide

44 dod bmw ride motorcycle shaft rider motorcycles rec denizens club level list

100 ryan cousineau dog nmm stafford compdyn questor bike winona springer vancouver bc

1 doesn new problem work probably let said years ll long question course

15 bike dod ride bikes riding motorcycle sun left road bnr ama honda

27 new sale thanks shipping mail offer price condition email interested asking best

4 car cars new engine 22 new sale shipping thanks mail offer price condition miles dealer price drive email interested asking best year road buy speed

3 car cars new engine 25 new sale shipping thanks mail offer price condition miles dealer price drive email interested asking sell year road buy speed

2 michael 3 thanks netcom mail andrew email new help uucp fax internet advance looking mark hi steve info opinions information mike phone mail send net

7 new

65 speed car drive manual drivers traffic cars fuel automatic auto lane shift

61 callison uoknor james car ecn lights uokmax fake bird continental alarm slb

9 car

26 said information cars went national engine didn research miles came program dealer told year new saw home price april left drive center started road washington says speed general took ford years apartment

dr

Figure: A subnetwork on “car & bike.”

buy

112 package hiram hotel college kids hirama job kou minimum douglas city inner

12 sale 39 uk new ac shipping ed offer picture condition dcs price sleeve liverpool asking demon sell tony cover simon cd oxford interested warwick best

51 ground wire wiring neutral outlets circuit cable electrical wires outlet connected house

7 team game hockey season 26 team game year season 13 year game team baseball games hockey players play games runs season players games nhl year play league new win baseball hit win league years new league players teams

97 bos det chi cal tor vs van pit stl mon que nyi

3 thanks 7 new 11 team mail information game email national hockey help research season fax program games advance year nhl year april looking league center hi washington teams info players general information play years phone cup dr send

6 team game hockey season games nhl year play league new players teams

7 year game team baseball runs games season players hit win league years

6 team game hockey season games nhl year play league new players teams

7 year game team baseball runs games season players hit win league years

4 team game hockey season games nhl year play league new players teams

7 year game team baseball runs games season players hit win league years

26 said 46 period 57 roger 62 sweden went play maynard fi didn power gainey finland came pp laurentian germany told puck ramsey canada saw goal bob players finnish hockey flyers home german player shots left april uwo pts started swedish best scorer says wc gilmour second took czech business lindros apartment

78 gld columbia dare gary bitnet keenan domi je souviens cunixc selanne jets

94 mask gatech mike hrivnak prism gtd pts andrew city tulsa votes team

95 buffalo cmu hammerl andrew acsu clement ferguson rochester valerie ubvms lemieux pitt

2 michael netcom andrew new uucp internet mark steve opinions mike mail net

Figure: A subnetwork on “sports.”

10 year game team baseball runs games season players hit win league pitching

1 doesn new problem work probably let said years ll long question course

52 won lost san new kaldis rutgers york houston st astros louis reds

82 hall dave fame smith eddie murray winfield kingman yount steve guys bsu

16 gun koresh fbi believe 18 gun guns firearms crime law control new state batf said government children weapons government said believe guns new gas compound

4 god 24 koresh jesus fbi christian batf bible gas christ stratus christians compound waco life children faith atf believe cdt man believe christianity government church

10 gun koresh fbi believe batf said government guns children gas compound stratus

18 gun guns firearms crime control law weapons state new government self said

21 koresh fbi batf believe gas compound children stratus waco said atf government

13 gun guns firearms crime control law weapons state new self government said

22 koresh fbi batf believe gas compound children stratus waco said atf government

14 gun guns firearms crime control law weapons state new self government public

60 roby fbi udel handheld jim jmd chopin clinton scott children idbsu affair

122 irvine uiuc cso wpi stove brent electric uxh ccwf stratus mikey cold

14 law 26 said government went rights didn state came court told laws saw home states left public started case says civil took legal apartment federal

1 doesn 2 michael 7 new new netcom problem andrew information national work new research probably uucp program let internet year april mark said center steve years washington opinions ll general mike long years mail dr question net course

Figure: A subnetwork on “gun.”

23 gun guns firearms crime control weapons self criminals defense handgun firearm violent

76 militia amendment arms bear second constitution regulated government shall state ulowell organized

92 weapons iastate dan nuclear viking sorenson destruction weapon homosexuality isu roehm unusual

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Hierarchical model Inference

Upward-downward Gibbs sampling Lemma (Augment-and-conquer the gamma belief network) (1)

With pj

:= 1 − e −1 and (t+1)

pj

(t)

:= − ln(1 − pj )

.h i (t+1) (t) cj − ln(1 − pj )

for t = 1, . . . , T , one may connect the observed or latent counts (t) (t) x j ∈ ZKt−1 to the product Φ(t) θ j at layer t under the Poisson likelihood as

h  i (t) (t) (t) x j ∼ Pois −Φ(t) θ j ln 1 − pj .

21 / 48

Upward propagate latent counts Corollary (Propagate the latent counts upward) PKt−1 (t) (t)(t+1) (t) With mkj := x·jk := v =1 xvjk representing the number of times that factor k ∈ {1, . . . , Kt } of layer t appears in observation j, we can (t) propagate the latent counts xvj of layer t upward to layer t + 1 as o n  (t) (t) (t) (t) (t) xvj1 , . . . , xvjKt xvj , φv : , θ j  (t)

(t)

(t) (t)

φv 1 θ1j

∼ Mult xvj , P Kt

(t) (t)

k=1 φvk θkj

,..., P

(t)

φvKt θKt j

(t) (t) Kt k=1 φvk θkj

 ,

  (t) (t) (t) (t) (t) (φk |−) ∼ Dir η1 + x1·k , . . . , ηKt−1 + xKt−1 ·k , 

   (t)(t+1) (t+1) (t+1) (t)(t+1) (t+1) (t+1) , φk: , θ j ∼ CRT mkj , φk: θ j . mkj

(t+1)

xkj

Downward sample the hidden units Using the latent counts propagated upward and the gamma-Poisson conjugacy, we downward sample the hidden units as  i−1  h P (T +1) (T +1)  , (rk | −) ∼ Gam γ0 /KT + xk· , c0 − j ln 1 − pj  h  i−1  (T )(T +1) (T +1) (T ) (T ) , cj − ln 1 − pj , (θ j | −) ∼ Gam r + m j  i−1   h (T −1) (T ) (T −1)(T ) (T ) (T −1) , , cj − ln 1 − pj | −) ∼ Gam Φ(T ) θ j + m j (θ j .. . (t) (θ j

 | −) ∼ Gam

(t+1) Φ(t+1) θ j

+

(t)(t+1) mj ,

h  i−1  (t+1) (t) cj − ln 1 − pj ,

.. .  h  i−1  (1) (1)(2) (2) (1) (2) (2) (θ j | −) ∼ Gam Φ θ j + m j , cj − ln 1 − pj .

Figure: Graphical representation of the model and inference scheme.

Modeling overdispersion with distributed representation Assuming Φ(t) = I for all t ∈ 3, . . . , T , we have (1)(2)

mkj

(2)

(t)

(2)

(T )

(T +1)

. . . , θkj ∼ Gam(rk , 1/cj I

(t+1)

∼ NB(θkj , pj ), . . . , θkj ∼ Gam(θkj

(1)(2)

In comparison to PFA with mkj  (1)(2)  VMR mkj | rk by a factor of 1+

(2) pj

(t+1)

, 1/cj

),

). (2)

∼ NB(rk , pj ), the PGBN increases

T +1 X t=3

"

t  Y

(`) cj

−1

# ,

`=3

which is equal to (2)

1 + (T − 1)pj (t)

if we further assume cj I

= 1 for all t ≥ 3.

Therefore, by increasing the depth of the network to distribute the variance into more layers, the multilayer structure could increase its capability to model data variability.

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Hierarchical model Capturing the correlations between factors

I

For the GBN with T = 1, given the shared weight vector r , we have  (1)  (2) E x j Φ(1) , r = Φ(1) r /cj ;

I

(2)

For the GBN with T ≥ 2, given the weight vector θ j , we have  (1) (2)  (2) (2) E x j Φ(1) , Φ(2) , θ j = Φ(1) Φ(2) θ j /cj .

I

Thus in the prior, the co-occurrence patterns of the columns of Φ(1) are captured in a single vector r when T = 1, and are captured in the columns of Φ(2) when T ≥ 2.

I

Similarly, in the prior, if T ≥ t + 1, Q the co-occurrence patterns of the Kt columns of the projected topics t`=1 Φ(`) will be captured in the columns of the Kt × Kt+1 matrix Φ(t+1) . 26 / 48

30 turkish armenian armenians turks armenia turkey government genocide soviet said today new

32 turkish armenian armenians turks armenia turkey genocide government soviet said today new

1 doesn 2 michael 7 new new netcom problem andrew information national work new research probably uucp program let internet year april mark said center steve years washington opinions ll general mike long years mail dr question net course

14 law 26 said 32 israel 34 turkish government went israeli armenian rights didn lebanese armenians state came lebanon turks court told attacks armenia laws saw peace turkey genocide arab home states soviet israelis left public today civilians started case russian killed says civil palestinian government took legal war soldiers apartment federal

98 bosnia arromdee jhu serbs somalia ken jyusenkyou muslims sides turkey ethnic intervention

132 greek greece turkish greeks cyprus minority uv turks turkey cypriot ethnic cypriots

143 henrik bm armenia armenians azeris cyprus turkey karabakh turkish kpc turks conflict

148 planes onur yalcin armenians armenia henrik announced weapons oy homeland azerbadjan karabakh

160 sdpa davidian book urartu yalanci published page pages forged argic serdar quotes

198 nazi german hitler nazis chancellor ss himmler indians dodd translation kastner party

212 anania persian problems shirak problem persians orbeli prince famous ancient troops fourth

251 istanbul ermeni ankara ed osmanli turk office hakkinda mezalimi ermeniler askeri nun

Figure: The tree rooted at node 30 of layer three on “Turkey & Armenia.”

259 kk translation kubilay kultigin suat politics strongest carleton fatherland filth souls bu

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Hierarchical model Learning the network structure with greedy layer-wise training

Learning the network structure

I

Using greedy layer-wise training together with the gamma-negative binomial process on the top hidden layer.

28 / 48

Modeling binary and nonnegative real observations I

We link binary observations to the latent counts at layer one as  (1) (1) bvj = 1 xvj ≥ 1 . I

For inference, we sample the latent counts at layer one from the truncated Poisson distribution as ! K1 X  (1) (1) (1) (1) xvj | − ∼ bvj · Pois+ φvk θkj . k=1

I

We link nonnegative real observations to the latent counts at layer one using  (1) (1) yvj ∼ Gam xvj , 1/aj . I

(1)

(1)

(1)

For inference, we let xvj = 0 if yvj = 0 and sample xvj from the truncated Bessel distribution as   v u K1 u (1) X  (1) (1) (1) φvk θkj  xvj | − ∼ Bessel−1 2taj yvj k=1 (1)

if yvj > 0.

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Example results Count data

Multivariate count data I

Train on the 20 newsgroups dataset, each document of which is summarized as a word count vector over the vocabulary. I I

I

We use all 11,269 training documents to infer a five layer network. After removing stopwords and terms that appear less than five times, we obtain a vocabulary with K0 = 33, 420. With η (t) = 0.05 for all t, the inferred network widths by the PGBN are [K1 , K2 , K3 , K4 , K5 ] = I I I I I I

I

[50, 50, 50, 50, 50] for K1 max = 50 [100, 99, 99, 94, 87] for K1 max = 100 [200, 161, 130, 94, 63] for K1 max = 200 [396, 109, 99, 82, 68] for K1 max = 400 [528, 129, 109, 98, 91] for K1 max = 600 [608, 100, 99, 96, 89] for K1 max = 800

Test on the 7,505 documents in the testing set. 30 / 48

Feature learning for multi-class classification

(a)

79

78 K 1max K 1max K 1max K 1max K 1max K 1max

76 75 74

Classification accuracy

Classification accuracy

77

= 50 = 100 = 200 = 400 = 600 = 800

73

77 76 75 T=1 T=2 T=3 T=4 T=5

74 73 72

72 71

(b)

79

78

71

1

2

3

4

Number of layers T

5

6

7

100

200

300

400

500

600

700

800

K 1max

Figure: Classification accuracy (%) of the PGBNs with Algorithm 1 for 20newsgroups multi-class classification (a) as a function of the depth T with various K1 max and (b) as a function of K1 max with various depths, with η (t) = 0.05 for all layers.

Prediction of heldout words (a)

750

12

650

10

Perplexity

700

Perplexity

(b)

14 T=1 T=2 T=3 T=4 T=5

600

8 T=1 T=2 T=3 T=4 T=5

6 4 2

550

0 500

-2 25

100

200

400

K 1max

600

800

25

100

200

400

600

800

K 1max

Figure: (a) per-heldout-word perplexity (the lower the better) for the NIPS12 corpus (using the 2000 most frequent terms) as a function of the upper bound of the first layer width K1 max and network depth T , with 30% of the word tokens in each document used for training and η (t) = 0.05 for all t. (b) for visualization, each curve in (a) is reproduced by subtracting its values from the average perplexity of the single-layer network.

Perplexity using stochastic gradient-MCMC 2500 PGBN-Gibbs PGBN-SGREM PGBN-SGRRM PGBN-MLSGR

2000 900 880 860

1500

840 4000

4500

5000

5500

6000

Point-Perplexity

Point-Perplexity

2500

PGBN-Gibbs PGBN-SGREM PGBN-SGRRM PGBN-MLSGR

2000 950 900 850

1500

800 4000

4200

4400

4600

4800

5000

1000

1000 0

1000

2000

3000

4000

Time (Seconds)

(a)

5000

6000

0

1000

2000

3000

4000

5000

Time (Seconds)

(b)

Figure: Point-Perplexity results versus time. (a) RCV1. (b) Wiki. Gibbs: Gibbs sampling, batch learning SGREM: stochastic gradient Riemannian (expanded mean) [Patterson &Ten, 2013; Chen, Fox & Guestrin, 2014] I SGRRM: stochastic gradient Riemannian (reduced mean) [Cong, Bo & Zhou, 2016] I MLSGR: multilayer stochastic gradient Riemannian (reduced mean) [Cong, Bo & Zhou, 2016] I I

Simulate documents I

Draw θ (T ) , θ (T −1) , . . . , θ 1 using  h i−1  (T +1) , ∼ Gam r , cj 0  h i−1  (T ) (T ) ∼ Gam Φ(T ) θ j 0 , cj 0 ,

(T )

θj 0 (T −1)

θj 0

.. . (t) θj 0

 ∼ Gam

(1)

θj 0 (1)

(t+1) Φ(t+1) θ j 0 ,

h

(t+1) cj 0

i−1 

..  . h i−1  (2) (2) (2) ∼ Gam Φ θ j 0 , cj 0 . (1)

I

Calculate E [x j 0 ] = Φ(1) θ j 0

I

Display the top 100 words

,

Simulate documents

I

mac apple bit mhz ram simms mb like memory just don cpu people chip chips think color board ibm speed does know se video time machines motherboard hardware lc cache meg ns simm need upgrade built vram good quadra want centris price dx run way processor card clock slots make fpu internal did macs cards ve pin power really machine say faster said software intel macintosh right week writes slot going sx performance things edu years nubus possible thing monitor work point expansion rom iisi ll add dram better little slow let sure pc ii didn ethernet lciii case kind

Simulate documents

I

image jpeg gif file color files images format bit display convert quality formats colors programs program tiff picture viewer graphics bmp bits xv screen pixel read compression conversion zip shareware scale view jpg original save quicktime jfif free version best pcx viewing bitmap gifs simtel viewers don mac usenet resolution animation menu scanner pixels sites gray quantization displays better try msdos tga want current black faq converting white setting mirror xloadimage section ppm fractal amiga write algorithm mpeg pict targa arithmetic export scodal archive converted grasp lossless let space human grey directory pictures rgb demo scanned old choice grayscale compress

Simulate documents

I

medical health disease doctor pain patients treatment medicine cancer edu hiv blood use years patient writes cause skin don like just aids symptoms number article help diseases drug com effects information doctors infection physician normal chronic think taking care volume condition drugs page says cure people tobacco hicnet know newsletter effective therapy problem common time women prevent surgery children center immune research called april control effect weeks low syndrome hospital physicians states clinical diagnosed day med age good make caused severe reported public safety child said cdc usually diet national studies tissue months way cases causing migraine smokeless infections does

Simulate documents

I

men homosexual sex gay sexual homosexuality male don people partners promiscuous number just bi like study homosexuals percent cramer heterosexual think did dramatically numbers straight church reform population know report pythagorean life man good accept time said considered kinsey posted general optilink irrational social gays behavior way children make published johnson survey table new activity showing million statistics american sexuality shows want right women ve article ago exclusively eating virginia masters repent really say purpose member clayton apparent kolodny writes going press society evil function engaged relationships ryan evolutionary different does compared person join figure community edu chose interesting things

Simulate documents

I

nissan electronics wagon altima delcoelect kocrsv station gm subaru sumax delco spiros hughes wax pathfinder legacy kokomo wagons smorris scott toyota seattleu don just like strong silver software luxury derek proof stanza seattle cisco morris cymbal triantafyllopoulos sportscar think people know near fool ugly proud claims flat statistics lincoln sedans bullet karl lee perth puzzled miata sentra maxima acura infiniti corolla mgb untruth verbatim good time consider way based make stand guys writes noticed want ve heavy suggestion eat steven horrible uunet studies armor fisher lust designs study definately lexus remove conversion embodied aesthetic elvis attached honey stole designing wd

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Example results Binary data

Modeling high-dimensional sparse binary vectors (a)

77

Classification accuracy

Classification accuracy

T=1 T=2 T=3 T=4 T=5

76

76 K 1max = 50

75

K 1max = 100 K 1max = 200 K 1max = 400

74

K 1max = 600 K 1max = 800

73

75

74

73

72

72

71

(b)

77

71 1

2

3

4

Number of layers T

5

6

7

100

200

300

400

K 1max

500

600

700

800

Figure: Results of the BerPo-GBNs on the binarized 20newsgroups term-document count matrix. The widths of the hidden layers are automatically inferred. In a random trial with Algorithm 2, the inferred network widths [K1 , . . . , K5 ] for K1 max = 50, 100, 200, 400, 600, and 800 are [50, 50, 50, 50, 50], [100, 97, 95, 90, 82], [178, 145, 122, 97, 72], [184, 139, 119, 101, 75], [172, 165, 158, 138, 110], and [156, 151, 147, 134, 117], respectively. 40 / 48

Gamma Belief Networks (Deep Latent Dirichlet Allocation) Example results Nonnegative real data

Modeling high-dimensional sparse nonnegative real vectors

I

Train on the 60,000 MNIST digits in the training set, each digit of which is represented as a 784 dimensional nonnegative real vector. I I

We use all 60,000 to infer a five layer network. With η (t) = 0.05 for all t, the inferred network widths by the gamma-PGBN are [K1 , K2 , K3 , K4 , K5 ] = I I I I

I

[50, 50, 50, 50, 50] for K1 max = 50 [100, 100, 100, 100, 100] for K1 max = 100 [200, 200, 200, 200, 200] for K1 max = 200 [400, 400, 399, 385, 321] for K1 max = 400

Test on the 10,000 MNIST digits in the test set.

41 / 48

Figure: Visualization of the inferred {Φ(1) , · · ·, Φ(T ) } with K1max = 100. All Φ’s are projected to the first layer. First row from left to right: Φ(1) , Φ(1) Φ(2) , Φ(1) Φ(2) Φ(3) Second row from left to right: Φ(1) Φ(2) Φ(3) Φ(4) , Φ(1) Φ(2) Φ(3) Φ(4) Φ(5) .

Figure: Visualization of the inferred {Φ(1) , · · ·, Φ(T ) } with K1max = 400. All Φ’s are projected to the first layer. From (a) to (e): Φ(1) , Φ(1) Φ(2) , Φ(1) Φ(2) Φ(3) , Φ(1) Φ(2) Φ(3) Φ(4) , Φ(1) Φ(2) Φ(3) Φ(4) Φ(5) .

Figure: Left: layers 5 to 4;

Right: layers 4 to 3

Figure: Left: layers 3 to 2;

Right: layers 2 to 1

(a)

98

96

94 92

K 1max = 25 K 1max = 50

90

K 1max = 100 K 1max = 200

Classification accuracy

Classification accuracy

(b)

98

96

K 1max = 400

88

94 92 T=1 T=2 T=3 T=4 T=5

90 88 86

86

84

84 1

2

3

4

Number of layers T

5

6

7

25 50

100

200

400

K 1max

Figure: Left: Classification accuracy (%) of the PRG-GBNs with various K1max as a function of T1 max . Right: Classification accuracy (%) of the PRG-GBNs with various depths as a function of K1max . (Tmax ∈ {1, · · ·, 5}).

Conclusions I

The Poisson gamma belief network is proposed to extract a multilayer representation for high-dimensional count vectors.

I

Upward-downward Gibbs sampling to jointly train all layers.

I

A layer-wise training strategy to infer the network structure.

I

Deep latent Dirichlet allocation and multilayer stochastic gradient Riemannian MCMC.

I

Extension to modeling binary and nonnegative real data.

I

Understanding the data by examining the features of different layers and their relationships using the structure of the network.

I

For big data problems, in practice one may rarely has a sufficient budget to allow the first-layer width to grow without bound, thus it is natural to consider a deep network that can use a multilayer deep representation to better allocate its resource and increase its representation power with limited computational power.

I

Our algorithm provides a natural solution to achieve a good compromise between the widths and depth of the network.

Gamma Belief Networks (Deep Latent Dirichlet Allocation) References

Main References D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003. T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. arXiv preprint arXiv:1402.4102, 2014. S. Patterson and Y. W. Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In NIPS, pages 3102–3110, 2013. M. Zhou, Y. Cong, and B. Chen. The Poisson gamma belief network. In NIPS, 2015. M. Zhou, Y. Cong, and B. Chen. Gamma belief networks. arXiv:1512.03081, Dec. 2015. M. Zhou, O. Padilla, and J. G. Scott. Priors for random count matrices derived from a family of negative binomial processes. To appear in JASA, 2016. Y. Cong, B. Chen, and M. Zhou. Learning deep latent Dirichlet allocation via multilayer stochastic gradient Riemannian MCMC. Preprint, June 2016.

48 / 48

Comments