Interactive extraction of diverse vocal units from a planar embedding without the need for prior sound segmentation