Clustering task

Clustering is the grouping of a particular set of objects based on their descriptors, aggregating them according to their similarities.

To make clustering task it needs attribute_ids. If you need to use face attributes for the clustering task, set the descriptor type to “face” and use faces from the Luna Faces or events from the Luna Events. If you need to use bodies for the clustering task, set the descriptor type to “body” and get the attributes using events from Luna Events. In both cases, only objects with descriptors will be processed. One can optionally specify clustering threshold. Also it need account_id to for task creation. “save_images” flag is now available: existent images will be placed in an images subfolder in the result archive. One can optionally specify clustering parameter “use_track_info”. In that case objects with the same “track_id” will be put in the same clusters.

Clustering process

Clustering is done in several steps:

  • collect objects having attribute ids using provided filters

  • match every object with all other objects

  • download objects’ track ids if required

  • create clusters as groups of “connected components” from the similarity graph, link:

    here “connected” means that similarity is greater than provided threshold or default “DEFAULT_CLUSTERING_THRESHOLD” from the config.

For large amounts of data (more than 10,000 descriptors), it needs to increase configuration parameters:

  • “LUNA_TASKS_TO_MATCHER_TIMEOUTS”

  • “LUNA_PYTHON_MATCHER_TIMEOUTS”

  • “LUNA_PYTHON_MATCHER_PROXY_TIMEOUTS”

To increase performance (reduce execution time), the number of workers for the matcher can be increased or the matcher proxy can be avoided.

Additionally, can be increased “TASKS_TO_MATCHER_CONCURRENCY” configurator parameter, which will raise the limit on concurrent requests to the matcher.

IMPORTANT: The number of matcher’s workers and *”connection_pool_size” parameter must be greater than or equal to the “TASKS_TO_MATCHER_CONCURRENCY” parameter.

Examples of memory usage and execution time with enabled matcher proxy, 5 matcher workers, and with TASKS_TO_MATCHER_CONCURRENCY = 4 (used database with real unique descriptors):

  • threshold=0.5

Descriptors

RAM

Time

Non-zero elements

10K

< 1 GB

< 10 sec

71587

100K

< 1 GB

< 4 minutes

953096

1M

< 3 GB

< 6 hours

33338128

2M

< 9 GB

< 30 hours

133352512

  • threshold=0.81

Descriptors

RAM

Time

Non-zero elements

10K

< 1 GB

< 10 sec

64416

100K

< 1 GB

< 3 minutes

648656

1M

< 2 GB

< 4 hours

6911916

2M

< 4 GB

< 22 hours

27647664

  • threshold=0.9

Descriptors

RAM

Time

Non-zero elements

10K

< 1 GB

< 10 sec

60354

100K

< 1 GB

< 3 minutes

604842

1M

< 2 GB

< 4 hours

6241064

2M

< 4 GB

< 20 hours

24964256

*FYI, the results in the table are provided for informational purposes only and may vary significantly from user results depending on the data in the database and number of workers.

For detail see clustering task