[elastic/elasticsearch]在数据节点被清除传输之前记录堆栈跟踪

2025-10-29 990 views
1

https://github.com/elastic/elasticsearch/pull/118266在将数据传输回协调节点之前,清除了数据节点上的堆栈跟踪error_trace=false。然而,所有异常日志记录都发生在协调节点上。这一更改导致无法通过堆栈跟踪调试错误error_trace=false。

这里,我在堆栈跟踪被清除之前,在数据节点上记录了异常。它带有前缀,以[nodeId][indexName][shard]匹配rest.suppressed协调节点上的分片故障日志,从而可以轻松地从协调节点追踪到负责的数据节点。

这是否会导致(调试级别)日志过载? 此更改可能会记录每次搜索中每个索引的 [# shards] 次。但是,协调节点上的此日志:https ://github.com/elastic/elasticsearch/blob/937bcd9f51d68b7d1b20fe1a0a5ac56a3dc57f67/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java#L405已经在调试级别记录了 [# shards]*[# replicas] 次。并且在https://github.com/elastic/elasticsearch/pull/118266之前,每个日志都包含堆栈跟踪。因此,对于每个节点而言,此更改不会使日志量显著增加。

回答

7

正在 ping @elastic/es-search-foundations (团队:搜索基础)

3

嗨 @benchaplin,我已经为你创建了一个变更日志 YAML 文件。

9

我认为我们面临的主要问题是,用户提供给我们的(或者我们自己提供给自己的)日志中缺少了有用的回溯信息。所以我想知道这个补丁是否真的能解决这个问题?用户能否在数据节点日志中看到缺失的部分?他们是否知道如何获取这部分信息并将其提供给我们?

9

为了方便参考,这里简要列出我们当前的日志以及我添加的日志。我在 SearchService 中抛出了一个空指针异常 (NPE) 来触发日志记录。配置:3 个节点,3 个主分片,3 个副本。

(协调节点:我们得到 6 个这样的节点,每个分片一个)

[2025-03-28T08:17:51,928][DEBUG][o.e.a.s.TransportSearchAction] [runTask-0] [meJUNXYoT1iBSTnJgI6Unw][test][2]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[test], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true, allow_selectors=true, include_failure_indices=false], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={}}] lastShard [false] org.elasticsearch.transport.RemoteTransportException: [runTask-0][127.0.0.1:9300][indices:data/read/search[phase/query]] Caused by: java.lang.NullPointerException: testing123 [ no stack trace ]

(坐标节点:在经历上述 6 次失败后)

[2025-03-28T08:17:51,946][DEBUG][o.e.a.s.TransportSearchAction] [runTask-0] All shards failed for phase: [query] org.elasticsearch.ElasticsearchException$1: testing123 [ long stack trace for ElasticsearchException ] Caused by: java.lang.NullPointerException: testing123 [ no stack trace ]

(坐标节点:WARN状态 >= 500 时为该级别,DEBUG否则为该级别)

[2025-03-28T08:17:51,946][WARN ][r.suppressed ] [runTask-0] path: /test/_search, params: {index=test}, status: 500 Failed to execute phase [query], all shards failed; shardFailures {[CrhugeEAQNGHtZ14Y6Apjg][test][0]: org.elasticsearch.transport.RemoteTransportException: [runTask-1][127.0.0.1:9301][indices:data/read/search[phase/query]] Caused by: java.lang.NullPointerException: testing123 }{[meJUNXYoT1iBSTnJgI6Unw][test][1]: org.elasticsearch.transport.RemoteTransportException: [runTask-0][127.0.0.1:9300][indices:data/read/search[phase/query]] Caused by: java.lang.NullPointerException: testing123 }{[CrhugeEAQNGHtZ14Y6Apjg][test][2]: org.elasticsearch.transport.RemoteTransportException: [runTask-1][127.0.0.1:9301][indices:data/read/search[phase/query]] Caused by: java.lang.NullPointerException: testing123 } [ long stack trace ] Caused by: java.lang.NullPointerException: testing123 [ no stack trace ]

(数据节点:此 PR 的新日志- 我们在各个节点上获取了 6 个这样的日志,每个分片一个)

[2025-03-28T08:17:51,944][DEBUG][o.e.s.SearchService ] [runTask-1] [CrhugeEAQNGHtZ14Y6Apjg][test][2]: failed to execute search request java.lang.NullPointerException: testing123 at org.elasticsearch.server@9.1.0-SNAPSHOT/org.elasticsearch.search.SearchService.throwException(SearchService.java:768) at org.elasticsearch.server@9.1.0-SNAPSHOT/org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:802) at org.elasticsearch.server@9.1.0-SNAPSHOT/org.elasticsearch.search.SearchService.lambda$executeQueryPhase$6(SearchService.java:648) ... [ full stack trace ]

编辑 - 在b34afc1之后,新日志将与 r.suppressed 日志的级别匹配,因此WARN在本例中将是这样的。