Google三大论文之BigTable
Google三大论文之Bigtable
1 摘要
2 介绍
- 适用性广泛
- 可扩展
- 高性能和高可用性
3 数据模型
(row:string, column:string, time:int64)->string

3.1 行
3.2 列族
3.3 时间戳
- 用户可以指定只保存最后 n 个版本的数据
- 只保存“足够新”的版本的数据(比如,只保存最近 7 天的内容写入的数据)
4 API
// Open the table
Table *T = OpenOrDie(“/bigtable/web/webtable”);
// Write a new anchor and delete an old anchor
RowMutation r1(T, “com.cnn.www”);
r1.Set(“anchor:www.c-span.org”, “CNN”);
r1.Delete(“anchor:www.abc.com”);
Operation op;
Apply(&op, &r1);
Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily(“anchor”);
stream->SetReturnAllVersions();
scanner.Lookup(“com.cnn.www”);
for (; !stream->Done(); stream->Next()) {
printf(“%s %s %lld %s\n”,
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
5 BigTable构件
- 确保在任何给定的时间内最多只有一个活动的 Master 副本
- 存储 BigTable 数据的自引导指令的位置
- 查找 Tablet 服务器,以及在 Tablet 服务器失效时进行善后
- 存储 BigTable 的模式信息
- 以及存储访问控制列表
6 介绍
- 为 Tablet 服务器分配 Tablets
- 检测新加入的或者过期失效的 Table 服务器
- 对 Tablet 服务器进行负载均衡
- 以及对保存在 GFS 上的文件进行垃圾收集
- 处理对模式的相关修改操作,例如建立表和列族
6.1 Tablet的位置

6.2 Table分配
- Master 服务器从 Chubby 获取一个唯一的 Master 锁,用来阻止创建其它的 Master 服务器实例
- Master 服务器扫描 Chubby 的服务器文件锁存储目录,获取当前正在运行的服务器列表
- Master 服务器和所有的正在运行的 Tablet 表服务器通信,获取每个 Tablet 服务器上 Tablet 的分配信息
- Master 服务器扫描 METADATA 表获取所有的 Tablet 的集合
6.3 Tablet服务

6.4 空间收缩
7 优化
7.1 局部性群组
7.2 压缩
7.3 通过缓存提高读操作的性能
7.3.1 Bloom 过滤器
7.3.2 Commit 日志的实现
7.3.3 Tablet恢复提速
7.3.4 利用不变性
8 性能评估

8.1 单个 Tablet 服务器的性能
8.2 性能提升
9 实际应用


9.1 Google Analytics
- Row Click 表(大约有 200TB 数据)的每一行存放了一个最终用户的会话。行的名字是一个包含 Web 站点名字以及用户会话创建时间的元组。这种模式保证了对同一个 Web 站点的访问会话是顺序的,会话按时间顺序存储。这个表可以压缩到原来尺寸的 14%
- Summary 表(大约有 20TB 的数据)包含了关于每个 Web 站点的、各种类型的预定义汇总信息。一个周期性运行的 MapReduce 任务根据 Raw Click 表的数据生成 Summary 表的数据,能够压缩到原有尺寸的 29%
9.2 Google Earth
9.3 个性化查询
10 经验教训
- 很多类型的错误都会导致大型分布式系统受损,这些错误不仅仅是通常的网络中断、或者很多分布式协议中设想的 fail-stop 类型的错误。内存数据损坏、网络中断、时钟偏差、机器挂起、扩展的和非对称的网络分区、使用的其它系统的Bug(比如 Chubby)、GFS 配额溢出、计划内和计划外的硬件维护。通过修改协议来解决这些问题, 比如RPC 机制中加入了 Checksum
- 在彻底了解一个新特性会被如何使用之后,再决定是否添加这个新特性是非常重要的。
- 系统级的监控对 Bigtable 非常重要。这个特性允许我们检测和修正很多的问题,比如 Tablet 数据结构上的锁的内容、在修改操作提交时对 GFS 的写入非常慢的问题、以及在 METADATA 表的 Tablet 不可用时,对 METADATA 表的访问挂起的问题。还可以帮助我们跟踪所有的集群状态、监控它们的大小、检查集群运行的我们软件的版本、监控集群流入数据的流量,以及检查是否有引发集群高延时的潜在因素。
- 简洁的设计和编码给维护和调试带来的巨大好处。
11 相关工作
- Boxwood:提供了诸如分布式协议、锁、分布式 Chunk 存储以及分布式 B-tree 存储,目的是提供创建类似文件系统、数据库等高级服务的基础构件。
- 分布式的hash表: CAN、Chord、Tapestry和 Pastry,应对各种不同的传输带宽、不可信的协作者、频繁的更改配置等。
- 并行的数据库系统,能够存储海量的数据:RAC, DB2
- 基于列的存储方案在压缩和磁盘读取方面具有的性能:C-Store、 Sybase IQ、SenSage、KDB+, MonetDB/X100,Daytona
- C-Store: 操作更像关系型数据库
12 结论
13 感谢
- David Nagle
- Brad Calder
- Dan Aguayo
- Sameer Ajmani
- Zhifeng Chen
- Bill Coughran
- Mike Epstein
- Healfdene Goguen
- Robert Griesemer
- Jeremy Hylton
- Josh Hyman
- Alex Khesin
- Joanna Kulik
- Alberto Lerner
- Sherry Listgarten
- Mike Maloney
- Eduardo Pinheiro
- Kathy Polizzi
- Frank Yellin
- Arthur Zwiegincew.
参考文献
-
[1] ABADI, D. J., MADDEN, S. R., AND FERREIRA, M. C. Integrating compression and execution in columnoriented database systems. Proc. of SIGMOD (2006).
-
[2] AILAMAKI, A., DEWITT, D. J., HILL, M. D., AND SKOUNAKIS, M. Weaving relations for cache performance.In The VLDB Journal (2001), pp. 169-180.
-
[3] BANGA, G., DRUSCHEL, P., AND MOGUL, J. C. Resource containers: A new facility for resource management in server systems. In Proc. of the 3rd OSDI (Feb. 1999), pp. 45-58.
-
[4] BARU, C. K., FECTEAU, G., GOYAL, A., HSIAO, H., JHINGRAN, A., PADMANABHAN, S., COPELAND,G. P., AND WILSON, W. G. DB2 parallel edition. IBM Systems Journal 34, 2 (1995), 292-322.
-
[5] BAVIER, A., BOWMAN, M., CHUN, B., CULLER, D., KARLIN, S., PETERSON, L., ROSCOE, T., SPALINK, T., AND WAWRZONIAK, M. Operating system support for planetary-scale network services. In Proc. of the 1st NSDI(Mar. 2004), pp. 253-266.
-
[6] BENTLEY, J. L., AND MCILROY, M. D. Data compression using long common strings. In Data Compression Conference (1999), pp. 287-295.
-
[7] BLOOM, B. H. Space/time trade-offs in hash coding with allowable errors. CACM 13, 7 (1970), 422-426.
-
[8] BURROWS, M. The Chubby lock service for looselycoupled distributed systems. In Proc. of the 7th OSDI (Nov. 2006).
-
[9] CHANDRA, T., GRIESEMER, R., AND REDSTONE, J.Paxos made live ? An engineering perspective. In Proc. of PODC (2007).
-
[10] COMER, D. Ubiquitous B-tree. Computing Surveys 11, 2 (June 1979), 121-137.
-
[11] COPELAND, G. P., ALEXANDER, W., BOUGHTER, E. E., AND KELLER, T. W. Data placement in Bubba. In Proc. of SIGMOD (1988), pp. 99-108.
-
[12] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on large clusters. In Proc. of the 6th OSDI (Dec. 2004), pp. 137-150.
-
[13] DEWITT, D., KATZ, R., OLKEN, F., SHAPIRO, L., STONEBRAKER, M., AND WOOD, D. Implementation techniques for main memory database systems. In Proc. of SIGMOD (June 1984), pp. 1-8.
-
[14] DEWITT, D. J., AND GRAY, J. Parallel database systems: The future of high performance database systems. CACM 35, 6 (June 1992), 85-98.
-
[15] FRENCH, C. D. One size ts all database architectures do not work for DSS. In Proc. of SIGMOD (May 1995), pp. 449-450.
-
[16] GAWLICK, D., AND KINKADE, D. Varieties of concurrency control in IMS/VS fast path. Database Engineering Bulletin 8, 2 (1985), 3-10.
-
[17] GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. In Proc. of the 19th ACM SOSP (Dec.2003), pp. 29-43.
-
[18] GRAY, J. Notes on database operating systems. In Operating Systems ? An Advanced Course, vol. 60 of Lecture Notes in Computer Science. Springer-Verlag, 1978.
-
[19] GREER, R. Daytona and the fourth-generation language Cymbal. In Proc. of SIGMOD (1999), pp. 525-526.
-
[20] HAGMANN, R. Reimplementing the Cedar file system using logging and group commit. In Proc. of the 11th SOSP (Dec. 1987), pp. 155-162.
-
[21] HARTMAN, J. H., AND OUSTERHOUT, J. K. The Zebra striped network file system. In Proc. of the 14th SOSP(Asheville, NC, 1993), pp. 29-43.
-
[22] KX.COM. kx.com/products/database.php. Product page.
-
[23] LAMPORT, L. The part-time parliament. ACM TOCS 16,2 (1998), 133-169.
-
[24] MACCORMICK, J., MURPHY, N., NAJORK, M., THEKKATH, C. A., AND ZHOU, L. Boxwood: Abstractions as the foundation for storage infrastructure. In Proc. of the 6th OSDI (Dec. 2004), pp. 105-120.
-
[25] MCCARTHY, J. Recursive functions of symbolic expressions and their computation by machine. CACM 3, 4 (Apr. 1960), 184-195.
-
[26] O’NEIL, P., CHENG, E., GAWLICK, D., AND O’NEIL, E. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351-385.
-
[27] ORACLE.COM. www.oracle.com/technology/products/database/clustering/index.html. Product page.
-
[28] PIKE, R., DORWARD, S., GRIESEMER, R., AND QUINLAN, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal 13, 4 (2005), 227-298.
-
[29] RATNASAMY, S., FRANCIS, P., HANDLEY, M., KARP, R., AND SHENKER, S. A scalable content-addressable network. In Proc. of SIGCOMM (Aug. 2001), pp. 161-172.
-
[30] ROWSTRON, A., AND DRUSCHEL, P. Pastry: Scalable, distributed object location and routing for largescale peer-to-peer systems. In Proc. of Middleware 2001(Nov. 2001), pp. 329-350.
-
[31] SENSAGE.COM. sensage.com/products-sensage.htm. Product page.
-
[32] STOICA, I., MORRIS, R., KARGER, D., KAASHOEK, M. F., AND BALAKRISHNAN, H. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proc. of SIGCOMM (Aug. 2001), pp. 149-160.
-
[33] STONEBRAKER, M. The case for shared nothing. Database Engineering Bulletin 9, 1 (Mar. 1986), 4-9.
-
[34] STONEBRAKER,M., ABADI, D. J., BATKIN, A., CHEN, X., CHERNIACK, M., FERREIRA, M., LAU, E., LIN, A., MADDEN, S., O’NEIL, E., O’NEIL, P., RASIN, A., TRAN, N., AND ZDONIK, S. C-Store: A columnoriented DBMS. In Proc. of VLDB (Aug. 2005), pp. 553-564.
-
[35] STONEBRAKER, M., AOKI, P. M., DEVINE, R., LITWIN, W., AND OLSON, M. A. Mariposa: A new architecture for distributed data. In Proc. of the Tenth ICDE(1994), IEEE Computer Society, pp. 54-65.
-
[36] SYBASE.COM. www.sybase.com/products/databaseservers/sybaseiq. Product page.
-
[37] ZHAO, B. Y., KUBIATOWICZ, J., AND JOSEPH, A. D. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Tech. Rep. UCB/CSD-01-1141, CS Division, UC Berkeley, Apr. 2001.
-
[38] ZUKOWSKI, M., BONCZ, P. A., NES, N., AND HEMAN, S. MonetDB/X100 ?A DBMS in the CPU cache. IEEE Data Eng. Bull. 28, 2 (2005), 17-22.
作者: UncleLLD 发表日期:2023 年 3 月 15 日