digiKam › digikam-devel

[digikam] [Bug 376661] New: When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

Classic

List

30 messages Options

Options

12

bugzilla_noreply

[digikam] [Bug 376661] New: When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

https://bugs.kde.org/show_bug.cgi?id=376661

Bug ID: 376661
Summary: When importing ~200,000 video files Digikam crashes in
about 2-5 seconds of starting.
Product: digikam
Version: 5.4.0
Platform: MS Windows
OS: MS Windows
Status: UNCONFIRMED
Severity: crash
Priority: NOR
Component: Import-Scanner
Assignee: [hidden email]
Reporter: [hidden email]
Target Milestone: ---

I added a bunch of folders that contain ~200,000 video files and hit refresh to
scan them into the database. Digikam crashes after about 2-5 seconds. This is
repeatable. Digikam will not add 200,000 video files.

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

https://bugs.kde.org/show_bug.cgi?id=376661

[hidden email] changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[hidden email]

--- Comment #1 from [hidden email] ---
Reproducible with 5.5.0pre release ?

https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM

Gilles Caulier

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #2 from Poz <[hidden email]> ---
Yes, same thing happens with 5.5.0pre release.

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

[hidden email] changed:

What |Removed |Added
----------------------------------------------------------------------------
Version|5.4.0 |5.5.0

--- Comment #3 from [hidden email] ---
Maik,

which solution can we apply to fix this entry :

1/ Disable autocompletion in tree search field. Report this problem to Qt team
to open API of QCompleter in goal to use current private methods.
2/ re-use KCompletion to backport classes in digiKam core with ajusted API for
digiKam

Gilles

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

Maik Qualmann <[hidden email]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[hidden email]

--- Comment #4 from Maik Qualmann <[hidden email]> ---
Gilles,

I think you mean Bug 368468. This bug here has a different cause, possibly
crash in Exiv2.

To Bug 36846:
The QCompleter is not the performance problem. This is fixed by a QTimer. The
main problem is the ever slower adding of items to the QTreeView.

Maik

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #5 from Maik Qualmann <[hidden email]> ---
An edit function for the first minutes after the comment would not be bad...

Maik

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #6 from [hidden email] ---
Poz,

We need a debugger backtrace to investigate in details.

See this page for details :

https://www.digikam.org/contrib

Gilles Caulier

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #7 from [hidden email] ---
Maik,

In comment #4 you talk about a slower adding of items to the QTreeView.

Where is located the problem exactly ? Did you profile execution time with
Valgrind ? In Digikam treeviewitem widget implementation ? In Digikam model
populated by the DB ? In DB interface to get data to host in widget ? In Qt5
implementation ?

Gilles

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #8 from [hidden email] ---
MAik,

In my office i write a fast shared memory mapping viewer in Qt5 using
QTreeview/item classes. I create item in treeview with no data, and i populate
all items in a separated thread because it take a lot of time.
At end i call a treeview update in main thread (X11 is not re-entrant). It's
very fast. The amount of item in treeview is very huge (more than 1000
entries).

Can we do the same in digiKam ?

Gilles

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #9 from Poz <[hidden email]> ---
Still running the 5.5.0pre
Okay so I went to the https://www.digikam.org/contrib and tryed a few things
with limited success, I will try more tomorrow.
First, the gdb in windows, not working well. I type in 'catch throw', and get
back 'Catchpoint 1 (throw)', seems good. Then I type in 'run' and get back:
-
Starting program:
No executable specified, use `target exec'.
-
Not sure what to do here??

Second thing I tried is the third party debug tool from system internals:
https://technet.microsoft.com/en-us/sysinternals/bb896647.aspx
Looks like some bad stuff happening for about 10.2 seconds before it crashes:
00000009 1.02899146 [17040] digikam.general: Trying to load
Embedded preview with libraw
00000010 1.02921200 [17040] digikam.rawengine: Failed to load
embedded RAW preview
00000011 1.02923596 [17040] digikam.general: Trying to load half
preview with libraw
00000012 1.02927971 [17040] digikam.general: Trying to load
Embedded preview with Exiv2
00000013 1.04443121 [17040] digikam.dimg: "Removed file path and
name" : QIMAGE file identified
00000014 1.04464126 [17040] digikam.dimg.qimage: Can not load "
"Removed file path and name" " using DImg::QImageLoader!
00000015 1.04492271 [17040] digikam.general: mimetype = "" ext =
"MOV"
00000016 1.04507148 [17040] digikam.general: Cannot create
thumbnail for "Removed file path and name"
00000017 1.04512084 [17040] digikam.general: Thumbnail is null for
"Removed file path and name"

I removed the file path and name for privacy reasons.
this repeats for various videos until crash, takes about 2/10ths of a second
per loop? (looks like from that snipit I gave you). video file types are
various, avi, flv, mov, mp4, and more, the example above is just mov.

This happens before the loops start when I hit refresh:
00000005 0.91890234 [17040] digikam.general: Using 8 CPU core to
run threads
00000006 0.91933465 [17040] digikam.general: Action Thread run 1
new jobs
00000007 0.93396312 [17040] digikam.general: Cancel Main Thread
00000008 0.93400776 [17040] digikam.general: One job is done

I will try to get more info tomorrow.

Also two other questions, I turned off the album sync when it starts because it
was crashing. How do I start it artificially, I thought that is what refresh
does, not apparently refresh only updates the thumbnails.
Also is it possible to do the FUZZY search on the thumb nails to file potential
duplicates? This is my real intent. I want to cut that 200,000 videos down to
100,000.
If not, is this a future feature? Can it be one? High demand I think.

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #10 from Poz <[hidden email]> ---
Spent some more time trying to figure out how to provide more data. while
running the debugger I also found this line:
[11624] digikam.metaengine: Exiv2 ( 3 ) : Xmp.video.Metadata dataLength was
found to be larger than 5000 entries considered invalid; not read.

If there is anything else I can do to help debug this, let me know! Thank you.

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #11 from [hidden email] ---
The xmp warning is not the problem.

But it's know that Exiv2 have many problem with video files.

I recommend to not try to scan your huge collection in one time.

Start with a fresh database and add video files by chunks step by step until
crash appear. To goal is to isolate the file which introduce the dysfunction.

After that, report the problem to Exiv2 bugzilla with the identified video file
for investigations. As DK windows installer include current Exiv2 source code,
we can rebuild a DK for windows with last fix from Exiv2.

For your problem with GDB under Windows, if command line version won't to start
digiKam (even if it work on my VM with Windows 7), you need to open a console
and go to the directory where gdb and digikam excutable are installed (it's the
same dir).

After that it's simple. Look the generic page for details :

http://stackoverflow.com/questions/4671900/how-do-i-use-the-mingw-gdb-debugger-to-debug-a-c-program-in-windows

Gilles Caulier

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #12 from [hidden email] ---
>Also is it possible to do the FUZZY search on the thumb nails to file >potential duplicates? This is my real intent. I want to cut that 200,000 >videos down to 100,000.
>If not, is this a future feature? Can it be one? High demand I think.

Poz,

The Fuzzy Search work only with Still Image currently.

To see a similar function for video, this will need an algorithm to create a
fingerprint of the first frame of video, in goal to compare later with DB.

This is how the fuzzy tool work actually. A simplified wavelets matrix is
computed with still image. We compare matrix together to found similarities.

For video we need a new matrix with the spacial information of video. Not
impossible but complex to write and test.

Gilles Caulier

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #13 from Poz <[hidden email]> ---
Are the thumbnails not readily available to do the fuzzy search on? I know they
are not the biggest but I think they are big enough, or if there is a setting
to render them a slightly higher resolution... That is how I imagined it would
work anyways, since the thumbnails would already be generated, half the work is
already done to fuzzy search videos...

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

Mario Frank <[hidden email]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[hidden email]

--- Comment #14 from Mario Frank <[hidden email]> ---
(In reply to Poz from comment #13)
> Are the thumbnails not readily available to do the fuzzy search on? I know
> they are not the biggest but I think they are big enough, or if there is a
> setting to render them a slightly higher resolution... That is how I
> imagined it would work anyways, since the thumbnails would already be
> generated, half the work is already done to fuzzy search videos...

Hey Poz,

Sadly, it is not this easy.
The fuzzy search creates a signature from images. This does not hold for
videos. Videos are quite more complex as the signature creation must be
uniformly done for all videos. But if videos have black frames in the
beginning, the search would lead to results which are, let's say, rubbish. The
most stable way I see is to take the first frame from every video that is not
plain, i.e. single-coloured. But this means we would have to generate images
until we find the first appropriate frame. This would slow down the
fingerprints generation significantly.
A stable implementation is not trivial here. I will think about a way more
closely over the weekend.

Best,
Mario

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #15 from Mario Frank <[hidden email]> ---
Hi again,

This will be a quite long text - sorry. But I want to make the problems as
clear as possible.

I thought about the fuzzy search for videos a bit more during my train travel.
In fact, even the first non-plain frame is worthless. If a user really wants to
use digiKam as catalog for videos (which is not the scope of digiKam in first
place IMHO), he will potentially have videos that have the same beginning, i.e.
intro but are different videos. Thus, also the first non-plain frame will
potentially lead to rubbish. I remember that I found some tools to find video
duplicates. The process they applied was to take the first n images of a video
and compare it to all others. A quite bad process IMO as with m videos you
generate n*m images and then have to make a comparison. This is awfully bad
from the view of complexity theory. And in practice, this process is, as can be
expected, awfully slow.

Nevertheless, the process is the probably best way to really recognise
duplicate videos. So, a way could be to generate a fingerprint over the first
or last n images (slows down fingerprint generation extremely). This still is
not robust as many videos may have the same intro (at least the first m
seconds, e.g. about m*25 frames. Usual intros take many seconds. So a *rather*
stable approach would be to take 1000 frames. As you can imagine, this is a big
amount of data to compute fingerprints for. Just imagine your 200,000 videos.
Fingerprinting them would mean to generate 200,000,000 images. Every image must
be generated which is no const-time process but at least linear time. So, even
with 1000 videos, i would expect computation time to be in measure of hours,
not minutes.

Let's take a look from the other side, outros are far more distinct than
intros. So, a lower number n can be taken, e.g. 100. This reduces the time
quite a lot. But is probably still not satisfying.

If no or only short intros/outros are there, only few images should be
sufficient and the process could work quite good.

But we cannot estimate, how the videos are structured. The FPS count may/will
differ from video to video. So, woking on frames explicitly may again lead to
low-quality results. So, the best way would be to take the n first/last seconds
and then the complexity cannot really be estimated here.
Also, I think, users should decide themselves, how many seconds are taken
(configuration) and if beginning or ending should be taken (configuration
again).

So, *if* this feature should be implemented, I see the following options for
users:
1) Take the first non-plain frame for fingerprinting (fast, probable not
suitable for e.g. cinema movies)
2) Take the n first seconds for fingerprinting (probably awfully slow, may be
suitable for e.g. cinema movies, overkill for self-produced movies)
3) Take the n last seconds for fingerprinting (probably slow, probably suitable
for e.g. cinema movies, less overkill for self-produced movies)

In a more precise algorithmic way, we would need an adoption of the
fingerprints maintenance stage:
Option 1: take the first non-plain frame for video fingerprints
Option 2: take the Option(number n) Option(first,last) seconds for video
fingerprinting.
Changing the current options *must* trigger delete the current fingerprints of
videos as otherwise, different
fingerprintings would coexist which leads to wrong results - except rebuild all
fingerprints is chosen.

Then, the fuzzy search could probably work without adoptions - but I am not
completely sure if it would work out of the box.

Best,
Mario

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #16 from [hidden email] ---
Mario,

In my office we capture Infrared plan sequence of events in a Tokamak to catch
physical dysfunctions while experience.

video can take more than 2 minutes in HD, no more. More than 20 experiences can
be done in a day. All video are lossless stored in a database.

There is no camera movements. Only the plasma inside the machine change the
contents. Depending of the experience parameters, the video contents willbe
different.

We have a process to recognize similar video into the database. It written in
Matlab. As i know the process cut the first frames where there is nothing
(black hole) until the light begin. After that a wavelets fingerprints is
computed with a flat image taken from some frames inside the video. Not whole
video is analyzed, but the algorithm try to detect the edge of change and
adjust the fingerprint, by parsing a section of the movie. This is how the
spacial (temporal) dimension is processed.

For each file, the fingerprint can give the average of similarity of video
comparing to others. When physicians want to look in experiences, they just
compare a video made with Tokamak settings and look if another one is similar.
The goal is to see if physical events are similar even if parameters are
different.

Of course, it's a special use case, as video are static plan with changing
contents, but i think the process is not too bad if we want to apply it on a
small section of DSC movies.

Note : I know just the theory. The code is not available of course.

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #17 from Poz <[hidden email]> ---
Wow the discussion here is fantastic. Thank you for the time and thought!

So yes, the approach I suggested of just using the thumbnails is clearly not
robust enough given the wide array of video content out there.
I think a lot of the problems come from very uniform videos, for example
standard intros or outros. My case has very non uniform videos (without any
intro or outros) where I can run through windows explorer and find duplicates
myself from simply looking at the thumbnails so I know at least 20% are
duplicates just from simple observation. The problem is that it is to much to
go through that many files and click each one individually. I have used Digikam
before on photos for duplicates and was amazed at how well it worked so
naturally I thought, 'man, I wish I could get digikam to access these
thumbnails for me, I could get rid of +95% of these duplicates in a day'. I
know there could be false positives, but I could live with 1% or something like
that. To further get rid of false positives there could be a video length
option of +-X seconds (default at 2 or something).

I currently use http://www.alldup.de/alldup_help/alldup.php
The content method works very well, I would say less then 0.001% false
positives. But it misses so very very much. It can take up for 48 hour to run,
but builds a database so it only compares new files added into the search. I
even use the file size method, for large files, this works very well. Smaller
files (<10 mb?) tend to have more false positives. Unfortunately due to
different compression and file types this does not catch them all either.

I think in the end, until computer hardware is faster, video duplicate searches
will require a number of different methods and some user input. Until then that
is what we have to work with/ around. I was just hoping for another way to slim
down on this video database. Thumbnail seemed like low hanging fruit.

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #18 from [hidden email] ---
Ok,

I disabled video metadata support in Exiv2 shared library used with windows
installer. New version can be downloaded in GDrive repository in few minutes :

https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM

Can you reproduce the problem with this version ?

Typically, the video file will be registered in database, but video metadata
will not be parsed to populate the database.

Thanks in advance for your feedback

Gilles Caulier

--
You are receiving this mail because:
You are the assignee for the bug.

bugzilla_noreply

[digikam] [Bug 376661] When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.

In reply to this post by bugzilla_noreply

https://bugs.kde.org/show_bug.cgi?id=376661

--- Comment #19 from Poz <[hidden email]> ---
I tried the version with disabled video metadata support in Exiv2 shared
library that you just posted.
It allows me to import all of the video files! Success! However they all appear
to be gray boxes with no thumbnails. Perhaps this is a separate issue?

--
You are receiving this mail because:
You are the assignee for the bug.

12