The general music sound has the most of the energy in the low and mid. There is not much energy at the maximum Frequency region that is to be concerned about the aliasing.

Historically, in the earlier age of Delta-Sigma, many were concerned about the frequency domain, the high-frequency aliasing, and the phase distortion.
Thus, they used the symmetric FIR filter. However, as time goes by, many realized that we need to take care of the time domain, pre-ringing(the red circle above), to keep the original punch and depth of the bass/mid.

If there's not much energy in the high-frequency region, we don't need to concern about the aliasing. Then, regarding only the time-domain filter coefficient viewpoint, the filter that looks most similar to impulse, and the filter that has no pre-rining would be the best.

Imagine that you knock the desk and make an impulse like sound.
Imagine how they sound like, how their waveform would look like.
Imagine the wave form those impulse like sound would be changed across the filters, following the filter impulse response.

The impulse would spread over time following filter response. That means the energy and the thickness of the sound would be divided across time. Depending on the filter coefficient's time-domain shape, we can get more energy concentrated in shorter time duration or get the energy spread/divided over time.

Please note that the makers started to provide the LPF filter option, to remove those pre-ringings from the filter, and to provide more impluse-like time-domain filter coefficient.

Filter phase, both the linear phase and min phase, both won't make any significant difference. The filter delay difference is extremely tiny, and we don't need to care about the delay.

The rule of thumb:
Choose what you prefer.

It depends on the music tracks you usually listen to, the volume level, your IEM and Headphones characteristic, and your preference.
That's why makers do provide the filter selection as an option.