我对facebook的运转着迷。这是一个很独特的环境,不容易被复制(他们的体系并不适合所有的公司,即使他们努力尝试过)。下面是我和facebook的朋友们关于他们如何开发和管理项目的记录。
现在距离我收集的这些信息又过去6个月了,我相信facebook肯定又对他们的项目开发实践进行了改进。所以这些记录可能会有点过时。同时facebook的工程师驱动文化也越来越为大众所知。非常感谢那些帮助我整理这篇文章的facebook的朋友们。
记录:
* 截止到2010年6月,facebook有将近2000名员工,10个月前只有1100名,一年之间差不多翻了一番。
* 两个最大的部门是工程师和运维,每个部门大概都是400-500人。这两个部门人数大约占了公司的一半。
* 产品经理与工程师的比例大约为1-7到1-10。
* 每个工程师入职时,都要接收4-6周的培训,通过修补bugs和听高级开发工程师的课程来熟悉facebook。
* 培训结束后,每个工程师都可以接触线上的数据库(更大的权力意味着更大的责任,也有一份"勿做清单",不然可能会被开,比如共享用户的隐私数据)。
* 有非常牢靠的安全体系,以免有人不小心/故意做了些不好的事。
* 每个工程师可以修改facebook的任何代码,随时可以迁入。
* 浓厚的工程师驱动文化。"产品经理基本可以被忽略",这是facebook一名员工的话。工程师可以修改流程的细节,重新安排工作任务,随时植入自己的想法。
* 在每月的跨部门会议上,由工程师来汇报工作进度,市场部和产品经理会出席会议,也可以做些简短的发言,但如果说得太多,很可能就会被打小报告。他们确实想让工程师来主导产品的开发,对自己的产品负责。
* 项目需要的资源都是自愿的
o 一个产品经理把工程师们召集到一起,让他们对他的想法产生兴趣。
o 工程师们决定开发那些让他们感兴趣的特性。
o 工程师跟他们的经理说:"我下周想开发这5个新特性"。
o 经理会让工程师独立开发,可能有时会让他优先完成一些特性。
o 工程师独立完成所有的特性——前端/后端/数据库,等等所有相关的部分。如果需要得到设计人员的帮助,需要先让设计人员对你的想法产生兴趣。其他如架构之类的也一样。但总体来说,工程师要独立完成所有的任务。
* 对于某个特性是否值得开发的争论,通常是这么解决的:花一个星期的时间完成他,并在小部分人群中(如1%)进行测试。
* 工程师常常希望解决难题,这能获得声望和尊敬。他们很难对前端项目或UI设计产生太大的兴趣。这跟其他公司可能正好相反。在facebook,后端任务,比如新的feed算法,广告投放算法,memcache优化等等,是工程师真正感兴趣的。
* 所有的代码修改都要进行审核(通过一个或多个工程师),但News Feed是个例外,因为太重要了,Zuckerberg会亲自review。
* 所有的修改至少要被一个人审核,而且这个系统可以让任何人很方便地审核其他人的代码,即使你没有邀请他
* 工程师负责测试,代码修复,和维护自己的项目。
* 每个办公室或通过VPN连接的员工会使用下一版的facebook,这个版本的facebook会经常更新,通常比公开的早1-12小时。所有的员工被强烈建议提交bugs,而且通常会很快被修复。
* 很奇怪只有很少的QA或自动测试——"大部分工程师都能写出基本没有bug的代码,只是在其他公司他们不需要这么做。如果有QA部门,他们只要把代码写完,扔给他们就行了"
* [针对上一条]我们有自动测试,代码发布前必须要通过测试。我们不相信"所有的工程师都能写出没有bug的代码",毕竟这是一个商业公司。
* 很奇怪,缺少产品经理的影响和控制——产品经理是很独立的和自由的。产生影响力的关键是与工程师和工程师的领导们们搞好关系。需要大致了解技术,不要提一些愚蠢的想法。
* 所有提交的代码每周二打包一次。
* 只要多一分努力,终于一天会发生改变。
* 星期二的代码发布,需要所有的提交过代码的工程师在场。
* 代码打包前,工程师必须在一个特殊的IRC channel上。
* 运维执行打包过程
o facebook有大约60000台服务器
o 有9个代码发布级别
o 最小的级别只有6台服务器
o 星期二的代码发布会先发布到6台服务器上,运维组会检测这6台服务器的反应,保证代码正常工作,然后再提交到下一级
o 如果发布出现了一些问题(如报错等等),那么就停止下一级的部署,提交出错代码的工程师负责修复问题,然后从头继续发布。
o 所以一次发布可能会经历几次重复:1-2-3-fix. 回到1. 1-2-3-4-5-fix. 回到1. 1-2-3-4-5-6-7-8-9
* 运维组是受过严格训练,倍受尊敬,而且有商业意识的。他们的工作包括分析错误日志,负载和内存状态等等。还包括用户行为。
* 代码发布期间,运维组使用IRC-based页面系统,可以通过facebook/email/irc/im/sms ping每一个工程师,如果需要他们注意的话。对运维组不做回应是一件很羞愧的事。
* 代码一旦发布到第9级,并且稳定运行,就算发布成功了。
* 如果一个特性没有按时完成,也没什么大不了的,下次完成时一并发布即可。
* 如果被svn-blamed,public shamed或工作经常疏忽就很可能被开除。"这是一个高效的文化"。不够高效或者不够聪明的员工会被剔除。管理层会在6个月的时间里观察你表现,如果不合格,只能说再见。每一级都是这个待遇,即使是C级别和VP级别,如果不够高效,也会被开除。
* 被责骂不会导致解雇。我们特别尊重别人,原谅别人。大部分高级工程师都或多或少犯过一些严重的错误,包括我。但没有人因此被解雇。
* 我也没有遇到过因为上面提到过的犯错误而被解雇。有些人犯了错,他们会非常努力地去修复,也让其他人得到了学习。
原文 How Facebook Ships Code
Posted: January 17, 2011 | Author: yeeguy | Filed under: business management, product management, social networks, startups | 233 Comments »
I’m fascinated by the way Facebook operates. It’s a very unique environment, not easily replicated (nor would their system work for all companies, even if they tried). These are notes gathered from talking with many friends at Facebook about how the company develops and releases software.
eems like others are also interested in Facebook… The company’s developer-driven culture is coming under greater public scrutiny and other companies are grappling with if/how to implement developer-driven culture. The company is pretty secretive about its internal processes, though. Facebook’s Engineering team releases public Notes on new features and some internal systems, but these are mostly “what” kinds of articles, not “how”… So it’s not easy for outsiders to see how Facebook is able to innovate and optimize their service so much more effectively than other companies. In my own attempt as an outsider to understand more about how Facebook operates, I assembled these observations over a period of months. Out of respect for the privacy of my sources, I’ve removed all names and mention of specific features/products. And I’ve also waited for over six months to publish these notes, so they’re surely a bit out-of-date. I hope that releasing these notes will help shed some light on how Facebook has managed to push decision-making “down” in its organization without descending into chaos… It’s hard to argue with Facebook’s results or the coherence of Facebook’s product offerings. I think and hope that many consumer internet companies can learn from Facebook’s example.
HUGE thanks to the many folks who helped put together this view inside of Facebook. Thanks are also due to folks likeepriest and fryfrog who have written up corrections and edits.
Notes:
* as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago. Nearly doubling staff in under a year!
* the two largest teams are Engineering and Ops, with roughly 400-500 team members each. Between the two they make up about 50% of the company.
* product manager to engineer ratio is roughly 1-to-7 or 1-to-10
* all engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers. estimate 10% of each boot camp’s trainee class don’t make it and are counseled out of the organization.
* after boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offenses”, e.g., sharing private user data)
* [EDIT thx fryfrog] “There are also very good safe guards in place to prevent anyone at the company from doing the horrible sorts of things you can imagine people have the power to do being on the inside. If you have to “become” someone who is asking for support, this is logged along with a reason and closely reviewed. Straying here is not tolerated, period.”
* any engineer can modify any part of FB’s code base and check-in at-will
* very engineering driven culture. ”product managers are essentially useless here.” is a quote from an engineer. engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime. [EDITORIAL] The author of this blog post is a product manager, so this sentiment really caught my attention. As you’ll see in the rest of these notes, though, it’s apparent that Facebook’s culture has really embraced product management practices so it’s not as though the role of product management is somehow ignored or omitted. Rather, the culture of the company seems to be set so that *everyone* feels responsibility for the product.
* during monthly cross-team meetings, the engineers are the ones who present progress reports. product marketing and product management attend these meetings, but if they are particularly outspoken, there is actually feedback to the leadership that “product spoke too much at the last meeting.” they really want engineers to publicly own products and be the main point of contact for the things they built.
* resourcing for projects is purely voluntary.
o a PM lobbies group of engineers, tries to get them excited about their ideas.
o Engineers decide which ones sound interesting to work on.
o Engineer talks to their manager, says “I’d like to work on these 5 things this week.”
o Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.
o Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between. If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on. Same for Architect help. But in general, expectation is that engineers will handle everything they need themselves.
* arguments about whether or not a feature idea is worth doing or not generally get resolved by just spending a week implementing it and then testing it on a sample of users, e.g., 1% of Nevada users.
* engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is. can be hard to get engineers excited about working on front-end projects and user interfaces. this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.” At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.
* commits that affect certain high-priority features (e.g., news feed) get code reviewed before merge. News Feed is important enough that Zuckerberg reviews any changes to it, but that’s an exceptional case.
* [CORRECTION -- thx epriest] “There is mandatory code review for all changes (i.e., by one or more engineers). I think the article is just saying that Zuck doesn’t look at every change personally.”
* [CORRECTION thx fryfrog] “All changes are reviewed by at least one person, and the system is easy for anyone else to look at and review your code even if you don’t invite them to. It would take intentionally malicious behavior to get un-reviewed code in.”
* no QA at all, zero. engineers responsible for testing, bug fixes, and post-launch maintenance of their own work. there are some unit-testing and integration-testing frameworks available, but only sporadically used.
* [CORRECTION thx fryfrog] “I would also add that we do have QA, just not an official QA group. Every employee at an office or connected via VPN is using a version of the site that includes all the changes that are next in line to go out. This version is updated frequently and is usually 1-12 hours ahead of what the world sees. All employees are strongly encouraged to report any bugs they see and these are very quickly actioned upon.”
* re: surprise at lack of QA or automated unit tests — “most engineers are capable of writing bug-free code. it’s just that they don’t have an incentive to do so at most companies. when there’s a QA department, it’s easy to just throw it over to them to find the errors.” [EDIT: please note that this was subjective opinion, I chose to include it in this post because of the stark contrast that this draws with standard development practice at other companies]
* [CORRECTION thx epriest] “We have automated testing, including “push-blocking” tests which must pass before the release goes out. We absolutely do not believe “most engineers are capable of writing bug-free code”, much less that this is a reasonable notion to base a business upon.”
* re: surprise at lack of PM influence/control — product managers have a lot of independence and freedom. The key to being influential is to have really good relationships with engineering managers. Need to be technical enough not to suggest stupid ideas. Aside from that, there’s no need to ask for any permission or pass any reviews when establishing roadmaps/backlogs. There are relatively few PMs, but they all feel like they have responsibility for a really important and personally-interesting area of the company.
* by default all code commits get packaged into weekly releases (tuesdays)
* with extra effort, changes can go out same day
* tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
* engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
* ops team runs code releases by gradually rolling code out
o facebook has around 60,000 servers
o there are 9 concentric levels for rolling out new code
o [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
o the smallest level is only 6 servers
o e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
o if a release is causing any issues (e.g., throwing errors, etc.) then push is halted. the engineer who committed the offending changeset is paged to fix the problem. and then the release starts over again at level 1.
o so a release may go thru levels repeatedly: 1-2-3-fix. back to 1. 1-2-3-4-5-fix. back to 1. 1-2-3-4-5-6-7-8-9.
* ops team is really well-trained, well-respected, and very business-aware. their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior. E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.
* during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention. not responding to ops team results in public shaming.
* once code has rolled out to level 9 and is stable, then done with weekly push.
* if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.
* getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired. ”it’s a very high performance culture”. people that aren’t productive or aren’t super talented really stick out. Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”. this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.
* [CORRECTION, thx epriest] “People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”
* [CORRECTION, thx epriest] “Getting blamed will NOT get you fired. We are extremely forgiving in this respect, and most of the senior engineers have pushed at least one horrible thing, myself included. As far as I know, no one has ever been fired for making mistakes of this nature.”
* [CORRECTION, thx fryfrog] “I also don’t know of anyone who has been fired for making mistakes like are mentioned in the article. I know of people who have inadvertently taken down the site. They work hard to fix what ever caused the problem and everyone learns from it. The public shaming is far more effective than fear of being fired, in my opinion.”
It’ll be super interesting to see how Facebook’s development culture evolves over time — and especially to see if the culture can continue scaling as the company grows into the thousands-of-employees.