Quick Upload

Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
Post to Twitter Post to Twitter
Share on Facebook
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons
« Prev Comments 0 - 0 of 0 Next »
    Add a comment If you have a SlideShare account, login to comment; otherwise comment as a guest.
    • We are currently the 5th biggest website on the internetWe recently got 100M active user accounts. We are still growing at a speed of doubling every 12 months. Yahoo has the largest active registered user base in the world, which is around 400M.That’s more than 2 billion page views a day.All these numbers put a huge load on our data center and software. The next slide I will show you the simplified architecture of Facebook.
    • In Facebook, we have 3 service tiers. The first tier, web tier, consists of more than 10K machines, and runs apache http service.The second tier, memcached tier, has around 1K machines. They are mainly used to cache the data in memory.The third tier, mysql tier, has around 2K machines. All user data are stored on these machines.
    • This is the architecture of our backend data warehouing system. This system provides important information on the usage of our website, including but not limited to the number page views of each page, the number of active users in each country, etc.Because of the over 2B page views, we generate 1TB of compressed log data every day. All these data are stored and processed by the hadoop cluster which consists of over 300 machines. The summary of the log data is then copied to Oracle and MySQL databases, to make sure it is easy for people to access.
    • The majority of the software that we showed on the last 2 pages are actually open source software. At Facebook, we rely heavily on open source software. This allows us to leverage the existing efforts from all the great engineer all over the world, and helps us to reduce the development cost.On the other hand, Facebook is also a significant contributor to the open source community. We want to make sure our efforts can help others as well. We will show some of our contributions on the next page.
    • Facebook Open Platform is a snapshot of the infrastructure that runs Facebook Platform, which runs more than 50K applications from third parties.Thrift is a cross-language library for object serialization/deserialization. It was open sourced in April 2007, and only more than 1 year later, in July 2008, Google open sourced a similar library called protocol buffer.Cassandra is a distributed database system that supports row-level consistency. It can be considered as the open source version of Google's BigTable project.Hadoop is an open-source distributed file system and map-reduce framework. Hive is a database/data warehousing layer on top of that.
    • 13
    • We assume there are only 2 mappers and 2 reducers. Each machine runs 1 mapper and 1 reducer.15
    • 23
    • 24
    • 25
    SlideShare is now available on LinkedIn. Add it to your LinkedIn profile.

    Hadoop and Hive

    From zshao, 1 month ago Add as contact

    Facebook and Open Source talk for University of Illinois at Urbana-Champaign by Zheng Shao

    474 views | 0 comments | 1 favorites | 21 downloads | 0 embeds (Stats)

    Embed in your blog options close
    Embed (wordpress.com) Exclude related slideshows Embed in your blog

    More Info

    This slideshow is Public
    Total Views: 474 on Slideshare: 474 from embeds: 0
    Flagged as inappropriate Flag as inappropriate

    Flag as inappropriate

    Select your reason for flagging this slideshow as inappropriate.

    If needed, use the feedback form to let us know more details.

    Slideshow Transcript

    1. Slide 2: F a c e book a nd Ope n S ourc e Unive rs ity of Illinois , Urb a na -Ch a mpa ig n Zh e ng S h a o F a c e b ook Da ta Te a m 10/ 2008 3/
    2. Slide 3: Curre nt S ta tus of F a c e book ▪ #5 s ite o n the Inte rne t ▪ 100+ millio n us e rs ▪ 65 billio n mo nthly pag e vie ws ▪ > 2^32 ph otos • 400,000 de ve lope rs a round th e world • G oog le / s e a rc h q ue rie s 7 • Mo re h ourly ne ws s torie s th a n e ve ry oth e r m e dia s ourc e in h is tory c om b ine d
    3. Slide 4: Arc h ite c ture ▪ 3 s e rvic e tie rs : ▪ We b/Apa c h e ▪ Me mc a c h e d ▪ MyS QL ▪ Us e s : ▪ Linux ▪ P HP
    4. Slide 5: Arc h ite c ture (c ontinue d) ▪ B a c ke nd Da ta Wa re h ous ing S ys te m : Web Servers Scribe Servers Network Storage Oracle RAC Hive on Hadoop Cluster MySQL
    5. Slide 6: Ope n S ourc e S tra te g y ▪ F a c e b ook re lie s h e a vily on ope n s ourc e s oftwa re . ▪ F a c e b ook m a ke s b ig c ontrib utions to th e ope n s o urc e c om m unity. ▪ Wh a t’s th e b e ne fit of ope n s ourc e for you? ▪ Eve rybo dy c an us e it, inc luding YOU ▪ Eve rybo dy c an wo rk o n it, inc luding YOU
    6. Slide 7: Ope n S ourc e a t F a c e book ▪ Th rift ▪ C a s s a ndra ▪ F a c e b ook Ope n P la tform ▪ Me m C a c h e D e nh a nc e m e nts ▪ P HP e nh a nc e m e nts ▪ Hive o n to p o f Hado o p ▪ Ma ny m ore a t h ttp:/ de ve lope rs .fa c e b ook.c om / ns ourc e .ph p / ope
    7. Slide 8: Ha doop a nd Hive Distributed F System, Map-R ile educe, and SQL
    8. Slide 9: Ha doop ▪ Mim ic s G oog le F ile S ys te m a nd Ma p R e duc e ▪ S ta rte d b y Y a h oo in F e b 20 06 ▪ S upporte d/ e d b y 30+ c om pa nie s / us unive rs itie s now, inc luding G oog le : h ttp:/ wiki.a pa c h e .org / a doop/ owe re dB y / h P
    9. Slide 10: Wh y Ha doop? ▪ We Wa nt ▪ E ffic ie n c y ▪ S c a la b ility ▪ R e dunda nc y ▪ Loa d B a la n c e ▪ G oog le F ile S ys te m a nd Ma p R e d uc e is nic e ▪ B ut not ope n s ourc e
    10. Slide 11: Hive ▪ A da ta b a s e / ta wa re h ous e on top of Ha doop da ▪ R ic h da ta type s (s truc ts , lis ts a nd m a ps ) ▪ Effic ie nt imple me ntatio ns o f S QL filte rs , jo ins and g ro up-by’s o n to p o f map re duc e ▪ E a s y in te ra c tions with diffe re nt prog ra m m in g la ng ua g e s
    11. Slide 12: Wh y Hive ? ▪ We wa nt ▪ S QL s upport ▪ E ffic ie n c y ▪ S c a la b ility ▪ R e dunda nc y ▪ E a s y to a dd ne w fe a ture s ▪ Alte rna tive s : ▪ Tra ditiona l da ta b a s e s ys te m s ▪ P roprie ta ry da ta wa re h ous ing s ys te m s ▪ S a wz a ll / P ig
    12. Slide 13: Hive Arc h ite c ture Map Reduce HD S F We b UI Hive C LI Mg mt, e tc B rows ing Que rie s DDL Me ta S tore Hive QL P a rs e r P la nne r E xe c ution S e rDe Th rift J ute J S ON Th rift AP I
    13. Slide 14: (S implifie d) Ma p R e duc e R e vie w Ma c h ine 1 <k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1> <nk1, 2> <k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6> <nk3, 1> <k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3> Loc a l G lob a l Loc a l Loc a l Ma p S h uffle S ort R e duc e Ma c h ine 2 <k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4> <k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3> <k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>
    14. Slide 15: Hive QL – J oin pa g e _vie w pv_us e rs us e r pa g e id us e rid time pa g e id a ge us e rid a ge g e nde r 1 111 9:08:01 1 25 X 111 25 fe ma le = 2 111 9:08:13 2 25 222 32 ma le 1 222 9:08:14 1 32 • S QL: INS E R T INTO TAB LE pv_us e rs S E LE CT pv.pa g e id, u.a g e F R OM pa g e _vie w pv J OIN us e r u ON (pv.us e rid = u.us e rid);
    15. Slide 16: Hive QL – J oin in Ma p R e duc e pa g e _vie w pa g e id us e rid time ke y va lue ke y va lue 1 111 9:08:01 111 <1,1> 111 <1,1> 2 111 9:08:13 111 <1,2> 111 <1,2> 1 222 9:08:14 222 <1,1> 111 <2,25> S h u ffle us e r Ma p S o rt R e duc e us e rid a ge g e nde r ke y va lue ke y va lue 111 25 fe ma le 111 <2,25> 222 <1,1> 222 32 ma le 222 <2,32> 222 <2,32>
    16. Slide 17: Hive QL – G roup B y pv_us e rs pa g e id_a g e _s um pa g e id a ge pa g e id a ge Count 1 25 1 25 1 2 25 2 25 2 1 32 1 32 1 2 25 • S QL: ▪ INS E R T INTO TAB LE pa g e id_a g e _s um ▪ S E LE CT pa g e id, a g e , c ount(1) ▪ F R OM pv_us e rs ▪ G R OUP B Y pa g e id, a g e ;
    17. Slide 18: Hive QL – G roup B y in Ma p R e duc e pv_us e rs p pa g e id a ge ke y va lue ke y va lue pa 1 25 <1,25> 1 <1,25> 1 2 25 <2,25> 1 <1,32> 1 S h u ffle R e duc e Ma p S o rt pa g e id a ge ke y va lue ke y va lue pa 1 32 <1,32> 1 <2,25> 1 2 25 <2,25> 1 <2,25> 1
    18. Slide 19: Hive QL – G roup B y with Dis tinc t pa g e _vie w pa g e id us e rid time re s ult 1 111 9:08:01 pa g e id c ount_dis tinc t_us e rid 2 111 9:08:13 1 2 1 222 9:08:14 2 1 2 111 9:08:20 ▪ S QL ▪ S E LE CT pa g e id, COUNT(DIS TINCT us e rid) ▪ F R OM pa g e _vie w G R OUP B Y pa g e id
    19. Slide 20: Hive QL – G rou p B y with Dis tinc t in Ma p R e duc e ▪ s e nd your th inking s to z s h a o@fa c e book.c om
    20. Slide 21: Hive Optim iz a tions Efficient execution of SQL on Map Reduce
    21. Slide 22: (S implifie d) Ma p R e duc e R e vis it Ma c h ine 1 <k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1> <nk1, 2> <k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6> <nk3, 1> <k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3> Loc a l Glo bal Lo c al Loc a l Ma p S huffle S o rt R e duc e Ma c h ine 2 <k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4> <k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3> <k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>
    22. Slide 23: Hive Optim iz a tions – Me rg e S e q ue ntia l Ma p R e duc e J ob s A ke y av AB 1 111 ke y av bv B 1 111 222 AB C ke y bv Ma p R e duc e ke y av bv cv 1 222 C Ma p R e duc e 1 111 222 333 ke y cv 1 333 ▪ S QL: ▪ F R OM (a join b on a .ke y = b .ke y) join c on a .ke y = c .ke y S E LE C T …
    23. Slide 24: Hive Optim iz a tions – S h a re C om m o n R e a d Ope ra tions • E xte nde d S QL pa g e id a ge pa g e id c ount ▪ FROM pv_us e rs 1 25 Ma p R e duc e 1 1 ▪ INSERT INTO TAB LE pv_pa g e id_s um 2 32 2 1 ▪ S E LE CT pa g e id, c ount(1) ▪ G R OUP B Y pa g e id ▪ INSERT INTO TAB LE pv_a g e _s um pa g e id a ge a ge c ount ▪ S E LE CT a g e , c ount(1) 1 25 Ma p R e duc e 25 1 ▪ G R OUP B Y a g e ; 2 32 32 1
    24. Slide 25: Hive Optim iz a tions – Loa d B a la nc e P rob le m pv_us e rs pa g e id a g e 1 25 pa g e idg e id_a g a rtiaum um pa _a g e _p e _s l_s pa g e id aaggee cc ount ount 1 25 Ma p-R e duc e 1 25 25 24 1 25 2 32 32 11 2 32 1 25 2 1 25
    25. Slide 26: F uture Works ▪ Da ta b a s e Ma na g e m e nt F unc tiona litie s ▪ Optim iz a tions us ing C om b ine rs ▪ Optim iz a tions b a s e d on Ta b le S ta tis tic s (Le a rn m ore a b ou t th a t in C S 5 11) ▪ Eve rybo dy c an us e it, inc luding YOU ▪ Eve rybo dy c an wo rk o n it, inc luding YOU
    26. Slide 27: Que s tions ? z s h a o@fa c e book.c om
    27. Slide 28: Cre dits Da ta Ma n a g e m e n t a t F a c e b o o k, J Hammerbacher eff Hive – Da ta Wa re h o us in g & An a lytic s on Ha d o o p , Joydeep Sen Sarma, Ashish T husoo P e ople S ure s h An th o n y Zh e n g S h a o P ra s a d C h a kka P e te Wyc koff Na m it J a in R a g h u Mu rth y J o yd e e p S e n S a rm a As h is h Th u s o o
    28. Slide 29: (c ) 2 008 F a c e b o ok, Inc . or its lic e ns o rs .  \"F a c e b oo k\" is a re g is te re d tra de m a rk of F a c e b o ok, Inc .. All rig h ts re s e rve d. 1.0
    29. Slide 30: Appe ndix P a g e s
    30. Slide 31: De a ling with S truc ture d Da ta ▪ Type s ys te m ▪ P rimitive type s ▪ R e c urs ive ly build up us ing Compos ition/ ps / ts Ma Lis ▪ G e ne ric (De )S e ria liz a tion Inte rfa c e (S e rDe ) ▪ To re c urs ive ly lis t s c h e ma ▪ To re c urs ive ly a c c e s s fie lds with in a row obje c t ▪ S e ria liz a tion fa m ilie s im ple m e nt inte rfa c e ▪ Th rift DDL ba s e d S e rDe ▪ De limite d te xt ba s e d S e rDe ▪ You c a n write your own S e rDe ▪ S c h e m a E volution
    31. Slide 32: Me ta S tore ▪ S tore s Ta b le / a rtition prope rtie s : P ▪ Ta b le s c h e m a a nd S e rDe lib ra ry ▪ Ta b le Loc a tion on HDF S ▪ Log ic a l P a rtitioning ke ys a nd type s ▪ Oth e r inform a tion ▪ Th rift AP I ▪ C urre nt c lie nts in P h p (We b Inte rfa c e ), P yth on (old C LI), J a va (Que ry E ng ine a nd C LI), P e rl (Te s ts ) ▪ Me ta da ta c a n b e s tore d a s te xt file s or e ve n in a S QL b a c ke nd
    32. Slide 33: Hive CLI ▪ DDL: ▪ c re a te ta b le /drop ta b le / na m e ta b le re ▪ a lte r ta b le a dd c olum n ▪ B rows ing : ▪ s h ow ta b le s ▪ de s c rib e ta b le ▪ c a t ta b le ▪ Loa ding Da ta ▪ Que rie s
    33. Slide 34: We b UI for Hive ▪ Me ta S tore UI: ▪ B rows e a nd na vig a te a ll ta b le s in th e s ys te m ▪ C om m e n t on e a c h ta b le a nd e a c h c olu m n ▪ Als o c a ptu re s da ta de pe n de n c ie s ▪ HiP a l: ▪ Inte ra c tive ly c ons truc t S QL q ue rie s b y m ou s e c lic ks ▪ S upport proje c tion, filte rin g , g roup b y a nd joining ▪ Als o s u ppo rt
    34. Slide 35: Hive Que ry La ng ua g e ▪ P h ilos oph y ▪ S QL ▪ Ma p-R e duc e with c us to m s c ripts (h a doop s tre a m ing ) ▪ Que ry Ope ra tors ▪ P roje c tions ▪ E q ui-joins ▪ G roup b y ▪ S a m pling ▪ Orde r B y
    35. Slide 36: Hive QL – Cus tom Ma p/ e duc e S c ripts R • E xte nde d S QL: • F R OM ( • F R OM pv_us e rs • S E LE CT TRANSFORM(pv_us e rs .us e rid, pv_us e rs .da te ) • US ING 'ma p_s c ript' AS (dt, uid) • CLUSTER BY dt) ma p • INS E R T INTO TAB LE pv_us e rs _re duc e d • S E LE CT TRANSFORM(ma p.dt, ma p.uid) • US ING 're duc e _s c ript' AS (da te , c ount); • Ma p-R e duc e : s im ila r to h a doop s tre a m ing