PG (Placement groups) States

Documentation
Name:	PG (Placement groups) States
Description:	Information of PG's states
Modification date :	25/07/2019
Owner:	dodger
Notify changes to:	Owner
Tags:	ceph, object storage
Scalate to:	The_fucking_bofh

States

Creating

Ceph is still creating the placement group.

Active

Ceph will process requests to the placement group. Active Placement Groups will serve data.

Clean

Ceph replicated all objects in the placement group the correct number of times.
active+clean is the ideal PG state.

Down

A replica with necessary data is down, so the placement group is offline.
A PG with less than min_size replicas will be marked as down. Useceph health detail` to understand the backing OSD state.

Replay

The placement group is waiting for clients to replay operations after an OSD crashed.

Splitting

Ceph is splitting the placement group into multiple placement groups. (functional?)

Scrubbing

Ceph is checking the placement group for inconsistencies.

Degraded

Ceph has not replicated some objects in the placement group the correct number of times yet.

Inconsistent

Ceph detects inconsistencies in the one or more replicas of an object in the placement group (e.g. objects are the wrong size, objects are missing from one replica after recovery finished, etc.).

Peering (peering)

The placement group is undergoing the peering process.
A peering process should clear off without much delay, but if it stays and the number of PGs in a peering state does not reduce in number, the peering may be stuck.
To understand why a PG is stuck in peering, query the placement group and check if it is waiting on any other OSDs. To query a PG, use:

# ceph pg <pg.id> query

If the PG is waiting on another OSD for the peering to finish, bringing up that OSD should solve this.

Repair

Ceph is checking the placement group and repairing any inconsistencies it finds (if possible).

Recovering

Ceph is migrating/synchronising objects and their replicas.

Backfill

Ceph is scanning and synchronising the entire contents of a placement group instead of inferring what contents need to be synchronised from the logs of recent operations. Backfill is a special case of recovery.

Wait-backfill

The placement group is waiting in line to start backfill.

Backfill-toofull (backfill_toofull)

A backfill operation is waiting because the destination OSD is over its full ratio.
Placement Groups which are in a backfill_toofull state will have the backing OSDs hitting the osd_backfill_full_ratio (0.85 by default).
Any OSD hitting this threshold will prevent data backfilling from other OSDs to itself.
NOTE: Any PGs hitting osd_backfill_full_ratio will still serve read and writes, and also rebalance. Only the backfill is blocked, to prevent the OSD hitting the full_ratio faster.
To understand the osd_backfill_full_ratio of the OSDs, use:

   # ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep backfill_full_ratio

Incomplete

Ceph detects that a placement group is missing information about writes that may have occurred, or does not have any healthy copies. If any of the Placement Groups are in this state, try starting any failed OSDs that may contain the needed information or temporarily adjust min_size to allow recovery.

Remapped

The placement group is temporarily mapped to a different set of OSDs from what CRUSH specified.

Undersized

The placement group fewer copies than the configured pool replication level.

Peered

The placement group has peered but cannot serve client IO due to not having enough copies to reach the poolâs configured min_size parameter. Recovery may occur in this state, so the pg may heal up to min_size eventually.

IMPORTANT

A placement group can be in any of the above states and doesn't necessarily show a problem because it's not active + clean. It should ultimately reach an active + clean state automatically, but manual intervention may be needed sometime. Placement Groups in active+<some-state-other-than-clean> should serve data, since the PG is still active.

Usually, Ceph tries to fix/repair the Placement Group states and make it active + clean, but the PGs can end up in a stuck state in certain cases. The stuck states include:

Inactive

Placement groups in the Inactive state won't accept any I/O. They are usually waiting for an OSD with the most up-to-date data to come back up. In case the UP set and ACTING set are same, and the OSDs are not blocked on any other OSDs, this can be a problem with peering. Manually marking the primary OSD down will force the peering process to start since Ceph would bring the primary OSD back automatically up. The peering process is kickstarted once an OSD comes up.

Stale

The placement group is in an unknown state - because the OSDs that host them have not reported to the monitor cluster in a while (configured by mon_osd_report timeout).

Unclean

Placement groups contain objects that are not replicated the desired number of times. A very common reason for this would be OSDs that are down or OSDs with a 0 crush weight which prevents the PGs to replicate data onto the OSDs and thus achieve a clean state.

Snaptrim

Following are two more new PG states which were added in jewel release for snapshot trimming feature:

snaptrim: The PGs are currently being trimmed
snaptrim_wait: The PGs are waiting to be trimmed

Identifying stuck placement groups

To identify stuck placement groups, execute the following:

# ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]

Note: For more detail explanation of placement group states, please check monitoring_placement_group_states.

Ciberterminal Wiki

Table of Contents