Skip to main content

本周项目:Dat

· 17 min read

本周的重点项目是 Dat,这是一个基于 grant-funded 的开源去中心化数据集分发工具。Dat 由 地理分布式团队 构建和维护,其中许多人参与了本文的撰写。

¥This week's featured project is Dat, a grant-funded, open source, decentralized tool for distributing data sets. Dat is built and maintained by a geodistributed team, many of whom helped write this post.


A screenshot of the main view of dat-desktop, showing a few rows of shared dats

首先,什么是 Dat?

¥First off what is Dat?

我们希望将点对点和分布式系统的最佳部分引入数据共享。我们最初从科学数据共享开始,然后逐渐扩展到研究机构、政府、公共服务和开源团队。

¥We wanted to bring the best parts of peer to peer and distributed systems to data sharing. We started with scientific data sharing and then began branching out into research institutions, government, public service, and open source teams as well.

另一种思考方式是将其想象成一个同步和上传应用,例如 Dropbox 或 BitTorrent Sync,只不过 Dat 是 开源。我们的目标是成为一款功能强大、开源、非盈利的数据共享软件,适用于大数据、小型数据、中型数据、小批量数据和大批量数据。

¥Another way to think about it is a sync and upload app like Dropbox or BitTorrent Sync, except Dat is open source. Our goal is to be a a powerful, open source, non-profit data sharing software for big, small, medium, small-batch and big-batch data.

要使用 dat CLI 工具,你只需输入以下内容:

¥To use the dat CLI tool, all you have to type is:

dat share path/to/my/folder

dat 将创建一个链接,你可以使用该链接将该文件夹发送给其他人 - 任何中央服务器或第三方都无法访问你的数据。与 BitTorrent 不同,它也无法嗅探谁在共享什么内容 (查看 Dat Paper 草稿了解更多详情)。

¥And dat will create a link that you can use to send that folder to someone else -- no central servers or third parties get access to your data. Unlike BitTorrent, it's also impossible to sniff who is sharing what (see the Dat Paper draft for more details).

现在我们知道 Dat 是什么了。Dat Desktop 如何融入其中?

¥Now we know what Dat is. How does Dat Desktop fit in?

Dat 桌面 是一种让无法或不愿使用命令行的人也能访问 Dat 的方法。你可以在你的机器上托管多个数据,并通过网络提供数据。

¥Dat Desktop is a way to make Dat accessible to people who can't or don't want to use the command line. You can host multiple dats on your machine and serve the data over your network.

你能分享一些有趣的用例吗?

¥Can you share some cool use cases?

DataRefuge + Svalbard 项目

¥DataRefuge + Project Svalbard

我们正在开发一个代号为 Svalbard 项目 的项目,它与 DataRefuge 相关,DataRefuge 是一个致力于备份濒临消失的政府气候数据的小组。“斯瓦尔巴群岛”以位于北极的斯瓦尔巴全球种子库命名,该种子库拥有一个大型地下植物 DNA 备份库。我们的版本是一个大型的公共科学数据集的版本控制集合。一旦我们了解并信任元数据,我们就可以构建其他很酷的项目,例如 分布式志愿者数据存储网络

¥We're working on a thing codenamed Project Svalbard that is related to DataRefuge, a group working to back up government climate data at risk of disappearing. Svalbard is named after the Svalbard Global Seed Vault in the Arctic which has a big underground backup library of plant DNA. Our version of it is a big version controlled collection of public scientific datasets. Once we know and can trust the metadata, we can build other cool projects like a distributed volunteer data storage network.

加州公民数据联盟

¥California Civic Data Coalition

CACivicData 是一个开源档案库,提供从加州政治资金追踪数据库 CAL-ACCESS 的每日下载。它们使用 每日发布,这意味着它们的 zip 文件中会包含大量重复数据。我们正在努力将他们的数据托管为 Dat 存储库,这将减少引用特定版本或更新到新版本所需的麻烦和带宽。

¥CACivicData is an open-source archive serving up daily downloads from CAL-ACCESS, California's database tracking money in politics. They do daily releases, which means hosting a lot of duplicate data across their zip files. We're working on hosting their data as a Dat repository which will reduce the amount of hassle and bandwidth needed to refer to specific version or update to a newer version.

Electron 更新

¥Electron Updates

这一点目前尚未具体化,但我们认为一个有趣的用例是将编译好的 Electron 应用放入 Dat 仓库中,然后使用 Electron 中的 Dat 客户端拉取构建应用二进制文件的最新增量版本,以节省下载时间,同时也降低服务器的带宽成本。

¥This one isn't concrete yet, but we think a fun use case would be putting a compiled Electron app in a Dat repository, then using a Dat client in Electron to pull the latest deltas of the built app binary, to save on download time but also to reduce bandwidth costs for the server.

谁应该使用 Dat Desktop?

¥Who should be using Dat Desktop?

任何想要通过 P2P 网络共享和更新数据的人。数据科学家、开放数据黑客、研究人员、开发者。如果有人提出我们尚未想到的精彩用例,我们非常乐意接受反馈。你可以访问我们的 Gitter 聊天 并向我们咨询任何问题!

¥Anyone who wants to share and update data over a p2p network. Data scientists, open data hackers, researchers, developers. We're super receptive to feedback if anyone has a cool use case we haven't thought of yet. You can drop by our Gitter Chat and ask us anything!

Dat 和 Dat Desktop 的下一步计划是什么?

¥What's coming next in Dat and Dat Desktop?

用户账户和元数据发布。我们正在开发一个 Dat 注册表 Web 应用,该应用将部署在 datproject.org,本质上相当于 '用于数据集的 NPM',但需要注意的是,我们只是将其作为元数据目录,数据可以存储在任何在线位置(这与 NPM 或 GitHub 不同,因为 NPM 或 GitHub 将所有数据都集中托管,因为源代码足够小,你可以将所有数据都放在一个系统中)。由于许多数据集非常庞大,我们需要一个联合注册表(类似于 BitTorrent 追踪器的工作方式)。我们希望人们能够轻松地使用 Dat Desktop 中的注册表查找或发布数据集,从而使数据共享过程顺畅无阻。

¥User accounts and metadata publishing. We are working on a Dat registry web app to be deployed at datproject.org which will basically be an 'NPM for datasets', except the caveat being we are just going to be a metadata directory and the data can live anywhere online (as opposed to NPM or GitHub where all the data is centrally hosted, because source code is small enough you can fit it all in one system). Since many datasets are huge, we need a federated registry (similar to how BitTorrent trackers work). We want to make it easy for people to find or publish datasets with the registry from Dat Desktop, to make the data sharing process frictionless.

另一个功能是多写入/协作文件夹。我们有一个宏伟的计划,希望实现协作工作流,或许可以使用类似于 Git 的分支,但设计时要围绕数据集协作。但我们目前仍在致力于整体稳定性和协议标准化!

¥Another feature is multi-writer/collaborative folders. We have big plans to do collaborative workflows, maybe with branches, similar to git, except designed around dataset collaboration. But we're still working on overall stability and standardizing our protocols right now!

为什么选择在 Electron 上构建 Dat 桌面?

¥Why did you choose to build Dat Desktop on Electron?

Dat 使用 Node.js 构建,因此非常适合我们的集成。除此之外,由于科学家、研究人员和政府官员可能被迫为其机构使用某些设置,因此我们的用户使用各种各样的机器 - 这意味着我们需要能够针对 Windows、Linux 和 Mac 进行开发。Dat Desktop 让我们轻松实现了这一点。

¥Dat is built using Node.js, so it was a natural fit for our integration. Beyond this, our users use a variety of machines since scientists, researchers and government officials may be forced to use certain setups for their institutions -- this means we need to be able to target Windows and Linux as well as Mac. Dat Desktop gives us that quite easily.

在构建 Dat 和 Dat Desktop 时,你遇到了哪些挑战?

¥What are some challenges you've faced while building Dat and Dat Desktop?

搞清楚用户的需求。我们最初从表格数据集开始,但我们意识到这是一个有点复杂的问题,而且大多数人并不使用数据库。因此,在项目进行到一半时,我们从头开始重新设计了所有内容,以便使用文件系统,并且从未回头。

¥Figuring out what people want. We started with tabular datasets, but we realized that it was a bit of a complicated problem to solve and that most people don't use databases. So half way through the project, we redesigned everything from scratch to use a filesystem and haven't looked back.

我们也遇到了一些常见的 Electron 基础设施问题,包括:

¥We also ran into some general Electron infrastructure problems, including:

  • 遥测 - 如何捕获匿名使用情况统计信息

    ¥Telemetry - how to capture anonymous usage statistics

  • 更新 - 设置自动更新有点零碎,但很神奇。

    ¥Updates - It's kind of piecemeal and magic to set up automatic updates

  • 发布 - XCode 签名、在 Travis 上构建版本、进行 Beta 版本构建,所有这些都是挑战。

    ¥Releases - XCode signing, building releases on Travis, doing beta builds, all were challenges.

我们还在 Dat Desktop 的 '前端' 代码上使用了 Browserify 和一些很酷的 Browserify Transforms(这有点奇怪,因为即使我们有原生的 require,我们仍然会打包 - 但这是因为我们想要这些 Transforms)。为了更好地管理 CSS,我们从 Sass 切换到使用 sheetify。它极大地帮助我们模块化了 CSS,并使我们的 UI 更容易迁移到具有共享依赖的面向组件架构。例如,dat-colors 包含我们所有的颜色,并在我们所有的项目之间共享。

¥We also use Browserify and some cool Browserify Transforms on the 'front end' code in Dat Desktop (which is kind of weird because we still bundle even though we have native require -- but it's because we want the Transforms). To better help manage our CSS we switched from Sass to using sheetify. It's greatly helped us modularize our CSS and made it easier to move our UI to a component oriented architecture with shared dependencies. For example dat-colors contains all of our colors and is shared between all our projects.

我们一直是标准和极简抽象的忠实拥护者。我们的整个界面使用常规 DOM 节点和一些辅助库构建。我们已经开始将其中一些组件迁移到 base-elements,这是一个低级可复用组件库。与我们的大多数技术一样,我们不断迭代,直到完美为止,但作为一个团队,我们感觉我们正朝着正确的方向前进。

¥We've always been a big fan of standards and minimal abstractions. Our whole interface is built using regular DOM nodes with just a few helper libraries. We've started to move some of these components into base-elements, a library of low-level reusable components. As with most of our technology we keep iterating on it until we get it right, but as a team we have a feeling we're heading in the right direction here.

Electron 应该在哪些方面改进?

¥In what areas should Electron be improved?

我们认为最大的痛点是原生模块。必须使用 npm 为 Electron 重新构建模块会增加工作流程的复杂性。我们的团队开发了一个名为 prebuild 的模块,用于处理预构建的二进制文件,该模块在 Node 上运行良好,但 Electron 工作流程在安装后仍然需要自定义步骤,通常是 npm run rebuild。之前有点烦人。为了解决这个问题,我们最近改用了一种策略,将所有平台的所有编译二进制版本都打包到 npm tarball 中。这意味着 tarball 会变得更大(尽管可以使用 .so 文件(共享库)进行优化),这种方法避免了运行安装后脚本,也完全避免了 npm run rebuild 模式。这意味着 npm install 第一次就为 Electron 做了正确的事。

¥We think the biggest pain point is native modules. Having to rebuild your modules for Electron with npm adds complexity to the workflow. Our team developed a module called prebuild which handles pre-built binaries, which worked well for Node, but Electron workflows still required a custom step after installing, usually npm run rebuild. It was annoying. To address this we recently switched to a strategy where we bundle all compiled binary versions of all platforms inside the npm tarball. This means tarballs get larger (though this can be optimized with .so files - shared libraries), this approach avoids having to run post-install scripts and also avoids the npm run rebuild pattern completely. It means npm install does the right thing for Electron the first time.

你最喜欢 Electron 的哪些方面?

¥What are your favorite things about Electron?

这些 API 似乎经过深思熟虑,相对稳定,并且能够很好地与上游 Node 版本保持同步,除此之外,我们没什么可要求的了!

¥The APIs seem fairly well thought out, it's relatively stable, and it does a pretty good job at keeping up to date with upstream Node releases, not much else we can ask for!

有哪些 Electron 开发技巧可能对其他开发者有用?

¥Any Electron tips that might be useful to other developers?

如果你使用原生模块,不妨试试 prebuild

¥If you use native modules, give prebuild a shot!

关注 Dat 开发的最佳方式是什么?

¥What's the best way to follow Dat developments?

在 Twitter 上关注 @dat_project,或订阅我们的 邮件简报

¥Follow @dat_project on Twitter, or subscribe to our email newsletter.