数据竞争检测器

引言

数据竞争是并发系统中常见且最难调试的 bug 类型之一。当两个 goroutine 并发访问同一个变量，并且其中至少一个访问是写入时，就会发生数据竞争。有关详细信息，请参阅 Go 内存模型。

这是一个可能导致崩溃和内存损坏的数据竞争示例

func main() {
	c := make(chan bool)
	m := make(map[string]string)
	go func() {
		m["1"] = "a" // First conflicting access.
		c <- true
	}()
	m["2"] = "b" // Second conflicting access.
	<-c
	for k, v := range m {
		fmt.Println(k, v)
	}
}

用法

为了帮助诊断此类 bug，Go 包含一个内置的数据竞争检测器。要使用它，请将 -race 标志添加到 go 命令中

$ go test -race mypkg    // to test the package
$ go run -race mysrc.go  // to run the source file
$ go build -race mycmd   // to build the command
$ go install -race mypkg // to install the package

报告格式

当竞争检测器在程序中发现数据竞争时，它会打印一份报告。报告包含冲突访问的堆栈跟踪，以及涉及的 goroutine 创建时的堆栈。这是一个示例

WARNING: DATA RACE
Read by goroutine 185:
  net.(*pollServer).AddFD()
      src/net/fd_unix.go:89 +0x398
  net.(*pollServer).WaitWrite()
      src/net/fd_unix.go:247 +0x45
  net.(*netFD).Write()
      src/net/fd_unix.go:540 +0x4d4
  net.(*conn).Write()
      src/net/net.go:129 +0x101
  net.func·060()
      src/net/timeout_test.go:603 +0xaf

Previous write by goroutine 184:
  net.setWriteDeadline()
      src/net/sockopt_posix.go:135 +0xdf
  net.setDeadline()
      src/net/sockopt_posix.go:144 +0x9c
  net.(*conn).SetDeadline()
      src/net/net.go:161 +0xe3
  net.func·061()
      src/net/timeout_test.go:616 +0x3ed

Goroutine 185 (running) created at:
  net.func·061()
      src/net/timeout_test.go:609 +0x288

Goroutine 184 (running) created at:
  net.TestProlongTimeout()
      src/net/timeout_test.go:618 +0x298
  testing.tRunner()
      src/testing/testing.go:301 +0xe8

选项

GORACE 环境变量设置竞争检测器选项。格式为

GORACE="option1=val1 option2=val2"

选项包括

log_path (默认 stderr)：竞争检测器将其报告写入名为 log_path.pid 的文件。特殊名称 stdout 和 stderr 分别导致报告写入标准输出和标准错误。
exitcode (默认 66)：检测到竞争后退出时使用的退出状态。
strip_path_prefix (默认 "")：从所有报告的文件路径中去除此前缀，使报告更简洁。
history_size (默认 1)：每个 goroutine 的内存访问历史记录是 32K * 2**history_size elements。增加此值可以避免报告中出现“failed to restore the stack”错误，但会增加内存使用量。
halt_on_error (默认 0)：控制程序在报告第一个数据竞争后是否退出。
atexit_sleep_ms (默认 1000)：主 goroutine 在退出前睡眠的毫秒数。

示例

$ GORACE="log_path=/tmp/race/report strip_path_prefix=/my/go/sources/" go test -race

排除测试

当您使用 -race 标志构建时，go 命令定义了额外的构建标签 race。您可以使用该标签在运行竞争检测器时排除某些代码和测试。一些示例

// +build !race

package foo

// The test contains a data race. See issue 123.
func TestFoo(t *testing.T) {
	// ...
}

// The test fails under the race detector due to timeouts.
func TestBar(t *testing.T) {
	// ...
}

// The test takes too long under the race detector.
func TestBaz(t *testing.T) {
	// ...
}

如何使用

首先，使用竞争检测器运行您的测试（go test -race）。竞争检测器只发现运行时发生竞争，因此它无法发现未执行代码路径中的竞争。如果您的测试覆盖率不完整，您可以通过在实际工作负载下运行使用 -race 构建的二进制文件来发现更多竞争。

典型数据竞争

以下是一些典型的数据竞争。所有这些都可以通过竞争检测器检测到。

循环计数器上的竞争

func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	var i int
	for i = 0; i < 5; i++ {
		go func() {
			fmt.Println(i) // Not the 'i' you are looking for.
			wg.Done()
		}()
	}
	wg.Wait()
}

函数字面量中的变量 i 与循环使用的变量是同一个，因此 goroutine 中的读取与循环增量发生竞争。（此程序通常打印 55555，而不是 01234。）可以通过复制变量来修复该程序

func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	var i int
	for i = 0; i < 5; i++ {
		go func(j int) {
			fmt.Println(j) // Good. Read local copy of the loop counter.
			wg.Done()
		}(i)
	}
	wg.Wait()
}

意外共享的变量

// ParallelWrite writes data to file1 and file2, returns the errors.
func ParallelWrite(data []byte) chan error {
	res := make(chan error, 2)
	f1, err := os.Create("file1")
	if err != nil {
		res <- err
	} else {
		go func() {
			// This err is shared with the main goroutine,
			// so the write races with the write below.
			_, err = f1.Write(data)
			res <- err
			f1.Close()
		}()
	}
	f2, err := os.Create("file2") // The second conflicting write to err.
	if err != nil {
		res <- err
	} else {
		go func() {
			_, err = f2.Write(data)
			res <- err
			f2.Close()
		}()
	}
	return res
}

修复方法是在 goroutine 中引入新变量（注意使用 :=）

			...
			_, err := f1.Write(data)
			...
			_, err := f2.Write(data)
			...

未受保护的全局变量

如果以下代码从多个 goroutine 调用，则会导致 service 映射上的竞争。并发读写同一个映射是不安全的

var service map[string]net.Addr

func RegisterService(name string, addr net.Addr) {
	service[name] = addr
}

func LookupService(name string) net.Addr {
	return service[name]
}

为了使代码安全，请使用互斥锁保护访问

var (
	service   map[string]net.Addr
	serviceMu sync.Mutex
)

func RegisterService(name string, addr net.Addr) {
	serviceMu.Lock()
	defer serviceMu.Unlock()
	service[name] = addr
}

func LookupService(name string) net.Addr {
	serviceMu.Lock()
	defer serviceMu.Unlock()
	return service[name]
}

原始未受保护变量

数据竞争也可能发生在原始类型的变量上（bool、int、int64 等），如本例所示

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
	w.last = time.Now().UnixNano() // First conflicting access.
}

func (w *Watchdog) Start() {
	go func() {
		for {
			time.Sleep(time.Second)
			// Second conflicting access.
			if w.last < time.Now().Add(-10*time.Second).UnixNano() {
				fmt.Println("No keepalives for 10 seconds. Dying.")
				os.Exit(1)
			}
		}
	}()
}

即使是这种“无害”的数据竞争也可能导致难以调试的问题，这些问题是由内存访问的非原子性、与编译器优化冲突或处理器内存访问的重新排序问题引起的。

解决此竞争的典型方法是使用通道或互斥锁。为了保持无锁行为，也可以使用 sync/atomic 包。

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
	atomic.StoreInt64(&w.last, time.Now().UnixNano())
}

func (w *Watchdog) Start() {
	go func() {
		for {
			time.Sleep(time.Second)
			if atomic.LoadInt64(&w.last) < time.Now().Add(-10*time.Second).UnixNano() {
				fmt.Println("No keepalives for 10 seconds. Dying.")
				os.Exit(1)
			}
		}
	}()
}

非同步发送和关闭操作

正如本例所示，同一通道上的非同步发送和关闭操作也可能导致竞争条件

c := make(chan struct{}) // or buffered channel

// The race detector cannot derive the happens before relation
// for the following send and close operations. These two operations
// are unsynchronized and happen concurrently.
go func() { c <- struct{}{} }()
close(c)

根据 Go 内存模型，通道上的发送发生在从该通道接收完成之前。为了同步发送和关闭操作，请使用接收操作来保证在关闭之前完成发送

c := make(chan struct{}) // or buffered channel

go func() { c <- struct{}{} }()
<-c
close(c)

要求

竞争检测器需要启用 cgo，在非 Darwin 系统上需要安装 C 编译器。竞争检测器支持 linux/amd64、linux/ppc64le、linux/arm64、linux/s390x、linux/loong64、freebsd/amd64、netbsd/amd64、darwin/amd64、darwin/arm64 和 windows/amd64。

在 Windows 上，竞争检测器运行时对安装的 C 编译器版本很敏感；自 Go 1.21 起，使用 -race 构建程序需要一个包含 mingw-w64 运行时库版本 8 或更高版本的 C 编译器。您可以通过使用参数 --print-file-name libsynchronization.a 调用 C 编译器来测试您的 C 编译器。较新的兼容 C 编译器将打印此库的完整路径，而较旧的 C 编译器将只回显该参数。

运行时开销

竞争检测的开销因程序而异，但对于典型程序，内存使用量可能增加 5-10 倍，执行时间增加 2-20 倍。

竞争检测器目前为每个 defer 和 recover 语句额外分配 8 字节。这些额外分配在 goroutine 退出之前不会被回收。这意味着如果您的长时间运行的 goroutine 定期发出 defer 和 recover 调用，程序内存使用量可能会无限增长。这些内存分配不会显示在 runtime.ReadMemStats 或 runtime/pprof 的输出中。