数据竞争检测器

简介

数据竞争是并发系统中最常见且最难调试的错误类型之一。数据竞争发生在两个协程同时访问同一个变量并且至少其中一次访问是写入时。有关详细信息，请参阅 Go 内存模型。

以下是一个可能导致崩溃和内存损坏的数据竞争示例

func main() {
	c := make(chan bool)
	m := make(map[string]string)
	go func() {
		m["1"] = "a" // First conflicting access.
		c <- true
	}()
	m["2"] = "b" // Second conflicting access.
	<-c
	for k, v := range m {
		fmt.Println(k, v)
	}
}

用法

为了帮助诊断此类错误，Go 包含内置的数据竞争检测器。要使用它，请将 -race 标志添加到 go 命令

$ go test -race mypkg    // to test the package
$ go run -race mysrc.go  // to run the source file
$ go build -race mycmd   // to build the command
$ go install -race mypkg // to install the package

报告格式

当竞争检测器在程序中发现数据竞争时，它会打印一份报告。该报告包含冲突访问的堆栈跟踪，以及创建相关协程的堆栈。以下是一个示例

WARNING: DATA RACE
Read by goroutine 185:
  net.(*pollServer).AddFD()
      src/net/fd_unix.go:89 +0x398
  net.(*pollServer).WaitWrite()
      src/net/fd_unix.go:247 +0x45
  net.(*netFD).Write()
      src/net/fd_unix.go:540 +0x4d4
  net.(*conn).Write()
      src/net/net.go:129 +0x101
  net.func·060()
      src/net/timeout_test.go:603 +0xaf

Previous write by goroutine 184:
  net.setWriteDeadline()
      src/net/sockopt_posix.go:135 +0xdf
  net.setDeadline()
      src/net/sockopt_posix.go:144 +0x9c
  net.(*conn).SetDeadline()
      src/net/net.go:161 +0xe3
  net.func·061()
      src/net/timeout_test.go:616 +0x3ed

Goroutine 185 (running) created at:
  net.func·061()
      src/net/timeout_test.go:609 +0x288

Goroutine 184 (running) created at:
  net.TestProlongTimeout()
      src/net/timeout_test.go:618 +0x298
  testing.tRunner()
      src/testing/testing.go:301 +0xe8

选项

GORACE 环境变量设置竞争检测器选项。格式为

GORACE="option1=val1 option2=val2"

选项为

log_path（默认值 stderr）：竞争检测器将报告写入名为 log_path.pid 的文件中。特殊名称 stdout 和 stderr 使报告分别写入标准输出和标准错误。
exitcode（默认值 66）：在检测到竞争后退出时要使用的退出状态。
strip_path_prefix（默认值 ""）：从所有报告的文件路径中删除此前缀，以使报告更加简洁。
history_size（默认值 1）：每个协程的内存访问历史记录为 32K * 2**history_size 元素。增加此值可以避免报告中出现“无法恢复堆栈”错误，但会增加内存使用量。
halt_on_error（默认值 0）：控制在报告第一个数据竞争后程序是否退出。
atexit_sleep_ms（默认值 1000）：在退出之前，主协程要休眠的毫秒数。

示例

$ GORACE="log_path=/tmp/race/report strip_path_prefix=/my/go/sources/" go test -race

排除测试

当您使用 -race 标志构建时，go 命令会定义额外的构建标签 race。您可以使用该标签在运行竞争检测器时排除某些代码和测试。以下是一些示例

// +build !race

package foo

// The test contains a data race. See issue 123.
func TestFoo(t *testing.T) {
	// ...
}

// The test fails under the race detector due to timeouts.
func TestBar(t *testing.T) {
	// ...
}

// The test takes too long under the race detector.
func TestBaz(t *testing.T) {
	// ...
}

使用方法

首先，使用竞争检测器运行测试（go test -race）。竞争检测器仅查找运行时发生的竞争，因此它无法查找未执行的代码路径中的竞争。如果您的测试覆盖率不完整，则可能通过在真实负载下运行使用 -race 构建的二进制文件来发现更多竞争。

典型的数据竞争

以下是一些典型的数据竞争。所有这些都可以在竞争检测器的帮助下检测到。

循环计数器竞争

func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	var i int
	for i = 0; i < 5; i++ {
		go func() {
			fmt.Println(i) // Not the 'i' you are looking for.
			wg.Done()
		}()
	}
	wg.Wait()
}

函数字面量中的变量 i 与循环使用的变量相同，因此协程中的读取与循环增量存在竞争。（此程序通常打印 55555，而不是 01234。）可以通过复制变量来修复该程序

func main() {
	var wg sync.WaitGroup
	wg.Add(5)
	var i int
	for i = 0; i < 5; i++ {
		go func(j int) {
			fmt.Println(j) // Good. Read local copy of the loop counter.
			wg.Done()
		}(i)
	}
	wg.Wait()
}

意外共享的变量

// ParallelWrite writes data to file1 and file2, returns the errors.
func ParallelWrite(data []byte) chan error {
	res := make(chan error, 2)
	f1, err := os.Create("file1")
	if err != nil {
		res <- err
	} else {
		go func() {
			// This err is shared with the main goroutine,
			// so the write races with the write below.
			_, err = f1.Write(data)
			res <- err
			f1.Close()
		}()
	}
	f2, err := os.Create("file2") // The second conflicting write to err.
	if err != nil {
		res <- err
	} else {
		go func() {
			_, err = f2.Write(data)
			res <- err
			f2.Close()
		}()
	}
	return res
}

修复方法是在协程中引入新变量（注意 := 的使用）

			...
			_, err := f1.Write(data)
			...
			_, err := f2.Write(data)
			...

无保护的全局变量

如果从多个协程调用以下代码，则会导致 service 映射上的竞争。对同一映射的并发读取和写入是不安全的

var service map[string]net.Addr

func RegisterService(name string, addr net.Addr) {
	service[name] = addr
}

func LookupService(name string) net.Addr {
	return service[name]
}

要使代码安全，请使用互斥锁保护访问

var (
	service   map[string]net.Addr
	serviceMu sync.Mutex
)

func RegisterService(name string, addr net.Addr) {
	serviceMu.Lock()
	defer serviceMu.Unlock()
	service[name] = addr
}

func LookupService(name string) net.Addr {
	serviceMu.Lock()
	defer serviceMu.Unlock()
	return service[name]
}

原始无保护变量

数据竞争也可能发生在原始类型变量上（bool、int、int64 等），例如以下示例

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
	w.last = time.Now().UnixNano() // First conflicting access.
}

func (w *Watchdog) Start() {
	go func() {
		for {
			time.Sleep(time.Second)
			// Second conflicting access.
			if w.last < time.Now().Add(-10*time.Second).UnixNano() {
				fmt.Println("No keepalives for 10 seconds. Dying.")
				os.Exit(1)
			}
		}
	}()
}

即使这种“无害”的数据竞争也可能导致难以调试的问题，这些问题是由内存访问的非原子性、对编译器优化的干扰或访问处理器内存时的重新排序问题引起的。

对这种竞争的典型修复是使用通道或互斥锁。为了保留无锁行为，也可以使用 sync/atomic 包。

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
	atomic.StoreInt64(&w.last, time.Now().UnixNano())
}

func (w *Watchdog) Start() {
	go func() {
		for {
			time.Sleep(time.Second)
			if atomic.LoadInt64(&w.last) < time.Now().Add(-10*time.Second).UnixNano() {
				fmt.Println("No keepalives for 10 seconds. Dying.")
				os.Exit(1)
			}
		}
	}()
}

不同步的发送和关闭操作

正如这个示例所示，对同一通道的不同步发送和关闭操作也可能是一种竞争条件

c := make(chan struct{}) // or buffered channel

// The race detector cannot derive the happens before relation
// for the following send and close operations. These two operations
// are unsynchronized and happen concurrently.
go func() { c <- struct{}{} }()
close(c)

根据 Go 内存模型，对通道的发送发生在从该通道完成相应的接收之前。要同步发送和关闭操作，请使用一个保证发送在关闭之前完成的接收操作

c := make(chan struct{}) // or buffered channel

go func() { c <- struct{}{} }()
<-c
close(c)

要求

竞争检测器需要启用 cgo，并且在非 Darwin 系统上需要安装 C 编译器。竞争检测器支持 linux/amd64、linux/ppc64le、linux/arm64、linux/s390x、freebsd/amd64、netbsd/amd64、darwin/amd64、darwin/arm64 和 windows/amd64。

在 Windows 上，竞争检测器运行时对安装的 C 编译器版本很敏感；从 Go 1.21 开始，使用 -race 构建程序需要一个 C 编译器，该编译器包含版本 8 或更高版本的 mingw-w64 运行时库。您可以通过使用参数 --print-file-name libsynchronization.a 调用 C 编译器来测试您的 C 编译器。较新的符合标准的 C 编译器将为此库打印完整路径，而较旧的 C 编译器只会回显参数。

运行时开销

竞争检测的成本因程序而异，但对于典型程序，内存使用量可能会增加 5-10 倍，执行时间可能会增加 2-20 倍。

竞争检测器目前为每个 defer 和 recover 语句分配额外的 8 字节。这些额外的分配在协程退出之前不会恢复。这意味着，如果您有一个周期性地发出 defer 和 recover 调用的长时间运行的协程，程序内存使用量可能会无限增长。这些内存分配不会显示在 runtime.ReadMemStats 或 runtime/pprof 的输出中。