[digikam] [Bug 369051] New: Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] New: Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results

bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

            Bug ID: 369051
           Summary: Too low similarity threshold in fuzzy/duplicate search
                    bloats the results with potentially unwished
                    high-similarity results
           Product: digikam
           Version: 5.1.0
          Platform: Archlinux Packages
                OS: Linux
            Status: UNCONFIRMED
          Severity: wishlist
          Priority: NOR
         Component: Searches-Fuzzy
          Assignee: [hidden email]
          Reporter: [hidden email]

When having many pictures, including variants of one picture with different
quality, e.g. due to resizing, conversion and Collage creation, the
lower-quality pictures may be found only with low similarity threshold (e.g. 45
%). But the result set will contain all pictures with a similarity between 45 %
and 100 %. This can make the search for low-quality variants frustrating.
Having the possibility to specify the maximum similarity may solve the problem.

Reproducible: Always

Steps to Reproduce:
1.Have many series pictures you want to keep and some lower-quality variants
you want to get rid off.
2. Start a duplicate search with, let's say 40 %

Actual Results:  
You will get all pictures with a similarity above 40 %

Expected Results:  
It is designed to do that. But having an option to specify a maximum similarity
could be more convenient.

I implemented and tested that. Also, I can provide a patch file against the
master branch.
Here is the local commit message describing the implementation:
"Extended the findduplicatesview and fuzzysearchview with an
 additional QSpinBox which denotes the maximum similarity. The new QSpinBox
 has a minimum value that is the current value of the minimal similarity
 threshold. When the minimum threshold is altered, the range of the new
 QSpinBox is updated. If the minimum threshold is increased beyond the current
 value of the new QSpinBox, the value of the new QSpinBox is increased
 automatically. In the fuzzysearchview, altering the maximum similarity also
 triggers the reuild of the similar images album. The extension can be highly
 valuable if you knowingly want to ignore almost identical images but want to
 find images that have a similarity of, let's say 50-60%, due to resizing,
 cropping or something similar, without bloating your image pane."

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results

bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #1 from Mario Frank <[hidden email]> ---
Created attachment 101176
  --> https://bugs.kde.org/attachment.cgi?id=101176&action=edit
The patch for introducing a similarity interval

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]
            Summary|Too low similarity          |Too low similarity
                   |threshold in                |threshold in
                   |fuzzy/duplicate search      |fuzzy/duplicate search
                   |bloats the results with     |bloats the results with
                   |potentially unwished        |potentially unwished
                   |high-similarity results     |high-similarity results
                   |                            |[patch]

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #2 from [hidden email] ---
Mario,

The patch is very interesting and well implemented.

I plan to introduce your code after 5.3.0.

Q : currently, the icon view of fuzzy searches result is not filter by average
order. All items found are mixed. It can be a good idea to sort item in this
view, this will increase the usability. Your viewpoint ?

Best

Gilles Caulier

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #3 from Mario Frank <[hidden email]> ---
Hey Gilles,
those are good news. I agree with you concerning the improved usability by
ordering the, as I understand, list of results in the left pane where the
reference image and count of similar images is shown.
But introducing an order here means changing the signature of the functions in
haariface. Since QMap automatically has a sorting on the keys, we could use
this to introduce an order to the result set. One quite easy way would be to
wrap the QMap<qlonglong,QList<qlonglong>> as value of a avg-similarity-map.
This would surely increase the memory consumption during search. But the
automatic ordering by the similarity would circumvent a signifficant increase
of runtime.
After a small glimpse at the source code with grep, I found no possible
conflicts with other files concerning the definition of the result set.
Changing the return value types in haariface should be most likely safe. Should
I propose another patch for this issue?

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #4 from [hidden email] ---
yes one another patch to one another report please.
Thanks in advance

Gilles

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Version Fixed In|                            |5.4.0
      Latest Commit|                            |http://commits.kde.org/digi
                   |                            |kam/afe577f0b297a343ab412ce
                   |                            |95c1f75303edfb18b
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from [hidden email] ---
Git commit afe577f0b297a343ab412ce95c1f75303edfb18b by Gilles Caulier.
Committed on 10/11/2016 at 04:48.
Pushed by cgilles into branch 'master'.

Apply big patch #101176 from Mario Frank

This one extended the findduplicatesview and fuzzysearchview with an
additional QSpinBox which denotes the maximum similarity. The new QSpinBox
has a minimum value that is the current value of the minimal similarity
threshold. When the minimum threshold is altered, the range of the new
QSpinBox is updated. If the minimum threshold is increased beyond the current
value of the new QSpinBox, the value of the new QSpinBox is increased
automatically. In the fuzzysearchview, altering the maximum similarity also
triggers the reuild of the similar images album. The extension can be highly
valuable if you knowingly want to ignore almost identical images but want to
find images that have a similarity of, let's say 50-60%, due to resizing,
cropping or something similar, without bloating your image pane.
FIXED-IN: 5.4.0
CCMAIL: [hidden email]

M  +2    -0    app/utils/searchmodificationhelper.cpp
M  +1    -0    app/utils/searchmodificationhelper.h
M  +4    -3    libs/database/dbjobs/dbjob.cpp
M  +16   -5    libs/database/dbjobs/dbjobinfo.cpp
M  +7    -3    libs/database/dbjobs/dbjobinfo.h
M  +27   -16   libs/database/haar/haariface.cpp
M  +9    -8    libs/database/haar/haariface.h
M  +9    -2    libs/database/item/imagelister.cpp
M  +53   -25   utilities/fuzzysearch/findduplicatesview.cpp
M  +1    -0    utilities/fuzzysearch/findduplicatesview.h
M  +58   -11   utilities/fuzzysearch/fuzzysearchview.cpp
M  +2    -1    utilities/fuzzysearch/fuzzysearchview.h
M  +16   -10   utilities/maintenance/duplicatesfinder.cpp
M  +2    -2    utilities/maintenance/duplicatesfinder.h

http://commits.kde.org/digikam/afe577f0b297a343ab412ce95c1f75303edfb18b

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #6 from [hidden email] ---
Mario,

Your patch is now applied to current implementation and will be avaialble for
next 5.4.0 release.

Next step for me is to review your new patch from bug #372217. Note that your
next patch must close certainly bug #302923 (please confirm).

In parallel, can you check what can be do to improve again duplicate searches
tool with:

- bug #261417 : the searches album counter is not updated.
- bug #353331 : typically this one can be certainly closed as we can limit
search to a specific physical or virtual album. Please just review to confirm.
- bug #207188 : as i remember, the algorithm to process fingerprints over image
take a care about colors contents (else, this will have no sense...). So i"m
not sure if this file is valid...
- bug #274360 : i cannot figure why some king of image type are ignored. All
image format supported by digiKam will be processed while fingerprints
computation and searches.

Again, thanks for your contributions. I appreciate the quality of your patches,
which a a pleasure to review.

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #7 from [hidden email] ---
>Next step for me is to review your new patch from bug #372217. Note that your
>next patch must close certainly bug #302923 (please confirm).

I respond myself:

your patch from bug #372217 cannot solve bug #302923, because patch is
dedicated to sort search albums from left sidebar, not the icon view on the
center.

I will appreciate a patch aver icon-view model/view to be able to sort by
similarly level. Thanks in advance

Gilles Caulier

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #8 from Mario Frank <[hidden email]> ---
Hey Gilles,
Many thanks for the judgement about the quality of my patches.
I will try to fix what I can. Some of the "bugs" do not seem to be hard to fix.
Some other could be more complex.

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #9 from Mario Frank <[hidden email]> ---
By the way: the CCMAIL is incorrect. The correct one is
[hidden email]. If the dot should be a problem, just use
[hidden email].

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

Wolfgang Scheffner <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--- Comment #10 from Wolfgang Scheffner <[hidden email]> ---
Before I update the doc accordingly: shouldn't the labeling be changed now to
"Similarity range" or at least "Thresholds"?

--
You are receiving this mail because:
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[digikam] [Bug 369051] Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]

bugzilla_noreply
In reply to this post by bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=369051

--- Comment #11 from Mario Frank <[hidden email]> ---
I agree, Wolfgang. Similarity range is a better description here.
Moreover, I just realised that it is not possible to set a range in the
maintainance dialog. I will open a new file for both parts and submit a patch.

--
You are receiving this mail because:
You are the assignee for the bug.