Hi,
one important issue with digital images is the question of backup (eg. CD/DVD, optical media, .. several separate hard-disks, off-site hard-disks, ...). Another (maybe often over-looked?) aspect is whether the data (both on the master disk and the backups) are still correct. It may well happen that files just get corrupted on the hard-disk. (I recently had such an experience, where fortunately an old back-up on CD allowed to recover the few files). So my question is: How do you ensure the correctness of your data? What methods are useable and could one maybe integrate/provide part of the needed tools inside digikam? One approach might be to use a hash value (eg. md5): - Digikam could compute a hash value for every image and store it inside of digikam's database. This would allow, by an additional tool, to periodically (etc.) check for any possible changes (=corruption) of images. - Of course, if an image gets changed (e.g by adding comments, ratings, tags or other meta-data), the hash needs to be recomputed by the photo management application. ((Another possibility is to only compute the hash of the image data itself, but I think that a hash for the full file is better). - Also, one might even think of checking the hash before editing an image to ensure that it did not get corrupted. ((And maybe for the paranoid: even after saving a file one could compare with the data in memory?)) For backups one could add a file with all the hash-values for the files. Or each image file could be supplemented by a *.hash file. Again with a (simple) tool these hash values could be recomputed and compared. While maybe not yet fully sophisticated, this might be already better than blindly believing that all files on the hard-disk are still ok ;-). Are there any other important aspects digikam would need to enable checks of data integrity? Note that this is to some extent related to - "Md5 Checksums to identify pictures" http://bugs.kde.org/show_bug.cgi?id=110066 - "Uniquely identifying each image in a collection of images" http://bugs.kde.org/show_bug.cgi?id=125736 - "backup on dvd (and maybe sync with dvd-ram?)" http://bugs.kde.org/show_bug.cgi?id=113715 Any comments, thoughts, suggestions are very welcome! Best, Arnd _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
On Wednesday 16 January 2008, Arnd Baecker wrote:
> Hi, > > one important issue with digital images is the > question of backup (eg. CD/DVD, optical media, .. > several separate hard-disks, off-site hard-disks, ...). I use rsync to an off-site server. My mother uses rsync to a removable disk. I work for a company who will be providing affordable remote backup as a service to end users. > Another (maybe often over-looked?) aspect is whether > the data (both on the master disk and the backups) > are still correct. > It may well happen that files just get corrupted > on the hard-disk. Sure, that's unlikely but not impossible. > (I recently had such an experience, where > fortunately an old back-up on CD allowed to recover > the few files). > > So my question is: > How do you ensure the correctness of your data? Use proper hardware with ECC memory. No overclocked/non-ECC systems. > What methods are useable and could one maybe > integrate/provide part of the needed tools > inside digikam? > > One approach might be to use a hash value (eg. md5): > > - Digikam could compute a hash value for every > image and store it inside of digikam's database. > > This would allow, by an additional tool, to periodically (etc.) > check for any possible changes (=corruption) of images. But the application should not do this. The reason why we use a modern OS on modern hardware is, that the OS and the hardware will work together to provide us with everything from hardware abstraction to error correction. ZFS provides the extra layer of error detection along with error correction that you seek. I believe time would be better spent lobbying for a port of that technology to Linux (one way or another - yes I am aware that there are complications), rather than trying to patch in a little bit of that functionality into a single application (which will not help the general case). Anyway, that's my take on the issue. I believe any improvement that would be needed, is needed on a general system-wide scale, not on a single-application scale. -- Jakob Østergaard Hegelund _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
We're wandering off topic a little here...
* [hidden email] [2008-01-17 09:29:59] > Still, how would ZFS allow to check the integrity of backuped files > in comparison with those on the master hard-disk? From the "above the filesystem" view point ZFS doesn't provide anything special here. You can use fingerprinting, checksums, etc. The benefit of ZFS is that you can create a filesystem that has redundant storage and ZFS ensures the validity of the data in each copy that it keeps (using checksums). In essence, with ZFS you don't worry about the validity of the data at the application layer, the filesystem does it for you. dme. -- David Edmondson, http://dme.org _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
On Thu, 17 Jan 2008, David Edmondson wrote:
> We're wandering off topic a little here... ;-) > * [hidden email] [2008-01-17 09:29:59] > > Still, how would ZFS allow to check the integrity of backuped files > > in comparison with those on the master hard-disk? > > >From the "above the filesystem" view point ZFS doesn't provide > anything special here. You can use fingerprinting, checksums, etc. > > The benefit of ZFS is that you can create a filesystem that has > redundant storage and ZFS ensures the validity of the data in each > copy that it keeps (using checksums). > > In essence, with ZFS you don't worry about the validity of the data at > the application layer, the filesystem does it for you. So ZFS sounds really good! Now (trying to get back to the main topic): Would some checksum system, integrated into digikam, be useful, in view of ensuring data integrity for backups? I think it wouldn't be too difficult to implement something like this (I briefly discussed with Marcel on the IRC and with digikam >=0.10 such additions to the database will be easy). Note that it might come with a bit of a speed penalty when images/metadata get changed; however, this could be made configurable. Any further ideas/opinions? Best, Arnd _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
On Thu, 17 Jan 2008, Arnd Baecker wrote:
[...] > Would some checksum system, integrated into digikam, be useful, > in view of ensuring data integrity for backups? > I think it wouldn't be too difficult to implement something like > this (I briefly discussed with Marcel on the IRC and > with digikam >=0.10 such additions to the database will be easy). > Note that it might come with a bit of a speed penalty when > images/metadata get changed; however, this could be made > configurable. So in order to not just talk about stuff, but to try it out, I set up two python scripts which A) Generate a recursive tree which contains for each file below digikams root (e.g. ~/Pictures) a corresponding md5sum *.hash file B) Perform a check for each file in the backup if the checksum matches. Interestingly, in my case this already revealed around 500 files which did not match. (In this particular case it was essentially a user error, because I changed the metadata (GPS info) for those files, but without changing the file date. As I used rsync such that it would not copy over these files, the back-up went out of sync). So without a hash comparison, I would have never realized the inconsistency! Well, in my opinion we should get some tools to enable the check of data-integrity into digikam itself ... Any thoughts/comments/suggestions/... are welcome to flesh out the ideas of what would be necessary/what makes sense/...! Best, Arnd _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
On Jan 21, 2008 1:44 PM, Arnd Baecker <[hidden email]> wrote:
On Thu, 17 Jan 2008, Arnd Baecker wrote: Hello Arnd, What options are you passing to rsync? If you give it the '-c' option rsync will skip based on a checksum instead of mod-time and size. This would at least make your backup consistent with your master. However, it would not avoid the original-corrupted-then-backup issue you brought up earlier. As I think about this, it sounds like implementing a SCM. Basically, you want to know if a file has changed on disc with or, in your case, without intention. In theory, when you have a new file you would 'check it in' to the picture repository. If you make changes you 'check in' the new version of the file. In your case a "check-in" would be to create a check-sum of the file. This leads me to thinking about the "Versioned image" request that is already in digikam. Perhaps a single solution would handle both cases? Best Regards, Gerry _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
In reply to this post by Arnd Baecker
Am Monday 21 January 2008 schrieb Arnd Baecker:
> On Thu, 17 Jan 2008, Arnd Baecker wrote: > > [...] > > > Would some checksum system, integrated into digikam, be useful, > > in view of ensuring data integrity for backups? > > I think it wouldn't be too difficult to implement something like > > this (I briefly discussed with Marcel on the IRC and > > with digikam >=0.10 such additions to the database will be easy). > > Note that it might come with a bit of a speed penalty when > > images/metadata get changed; however, this could be made > > configurable. > > So in order to not just talk about stuff, but to try it out, I > set up two python scripts which > A) Generate a recursive tree which contains > for each file below digikams root (e.g. ~/Pictures) > a corresponding md5sum *.hash file > > B) Perform a check for each file in the backup > if the checksum matches. > > Interestingly, in my case this already revealed > around 500 files which did not match. > (In this particular case it was essentially a user > error, because I changed the metadata (GPS info) for > those files, but without changing the file date. > As I used rsync such that it would not copy over these > files, the back-up went out of sync). > > So without a hash comparison, I would have never realized > the inconsistency! > > Well, in my opinion we should get some tools to > enable the check of data-integrity into digikam itself ... > > Any thoughts/comments/suggestions/... are welcome > to flesh out the ideas of what would be necessary/what makes sense/...! > > Best, Arnd > > _______________________________________________ > Digikam-users mailing list > [hidden email] > https://mail.kde.org/mailman/listinfo/digikam-users I just read that strigi is exactly doing what we want, comparing files with sha1. Maybe sha1 is faster than md5? Strigi creates a sha1 of every file and stores it its DB. Then it checks for file date changes and if yes, runs sha1 to see if it really has changed before grepping it thouroughly. Gerhard -- ><((((º> ¸.·´¯`·... ><((((º> ¸.·´¯`·...¸ ><((((º> http://www.gerhard.fr _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users signature.asc (196 bytes) Download Attachment |
On Mon, 21 Jan 2008, Gerhard Kulzer wrote:
[...] > Arnd, can you send me the script? I'd like to try too. Done (off-list, it is really not ment for general consumption ... ;-) > I just read that strigi is exactly doing what we want, comparing files with > sha1. Maybe sha1 is faster than md5? No idea. Maybe we should do a speed test at some point ;-) > Strigi creates a sha1 of every file and stores it its DB. Then it checks for > file date changes and if yes, runs sha1 to see if it really has changed > before grepping it thouroughly. Looking at http://strigi.sourceforge.net/?q=features it does not seem to support images? I don't yet fully see how strigi will finally fit into "the" solution, this is definitively something to look at in more detail! Thanks for the pointer. Best, Arnd _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
In reply to this post by Gerry Patterson
On Mon, 21 Jan 2008, Gerry Patterson wrote:
> On Jan 21, 2008 1:44 PM, Arnd Baecker <[hidden email]> wrote: > > > On Thu, 17 Jan 2008, Arnd Baecker wrote: > > > > [...] > > > > > Would some checksum system, integrated into digikam, be useful, > > > in view of ensuring data integrity for backups? > > > I think it wouldn't be too difficult to implement something like > > > this (I briefly discussed with Marcel on the IRC and > > > with digikam >=0.10 such additions to the database will be easy). > > > Note that it might come with a bit of a speed penalty when > > > images/metadata get changed; however, this could be made > > > configurable. > > > > So in order to not just talk about stuff, but to try it out, I > > set up two python scripts which > > A) Generate a recursive tree which contains > > for each file below digikams root (e.g. ~/Pictures) > > a corresponding md5sum *.hash file > > > > B) Perform a check for each file in the backup > > if the checksum matches. > > > > Interestingly, in my case this already revealed > > around 500 files which did not match. > > (In this particular case it was essentially a user > > error, because I changed the metadata (GPS info) for > > those files, but without changing the file date. > > As I used rsync such that it would not copy over these > > files, the back-up went out of sync). > > > > So without a hash comparison, I would have never realized > > the inconsistency! > > > > Well, in my opinion we should get some tools to > > enable the check of data-integrity into digikam itself ... > > > > Any thoughts/comments/suggestions/... are welcome > > to flesh out the ideas of what would be necessary/what makes sense/...! > > > > Best, Arnd > > > > Hello Arnd, > > What options are you passing to rsync? If you give it the '-c' option rsync > will skip based on a checksum instead of mod-time and size. This would at > least make your backup consistent with your master. Yes, I should have used that. I did not do so because I feared that this would take much longer, but never verified this belief ... > However, it would not > avoid the original-corrupted-then-backup issue you brought up earlier. It seems to be something which happens more often than one thinks. At least Gerhard told me that he has this problem frequently ... > As I think about this, it sounds like implementing a SCM. Basically, you > want to know if a file has changed on disc with or, in your case, without > intention. In theory, when you have a new file you would 'check it in' to > the picture repository. If you make changes you 'check in' the new version > of the file. In your case a "check-in" would be to create a check-sum of > the file. Yes, this sounds like what we will need! > This leads me to thinking about the "Versioned image" request > that is already in digikam. Perhaps a single solution would handle both > cases? It depends a lot on how the versioning of images will be realized. But this should definitively be kept in mind! Thanks a lot for your comments! Best, Arnd _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
In reply to this post by Arnd Baecker
On Tuesday 22 January 2008, Arnd Baecker wrote:
> On Mon, 21 Jan 2008, Gerhard Kulzer wrote: > > [...] > > > Arnd, can you send me the script? I'd like to try too. > > Done (off-list, it is really not ment for general consumption ... ;-) > > > I just read that strigi is exactly doing what we want, comparing > > files with sha1. Maybe sha1 is faster than md5? > > No idea. Maybe we should do a speed test at some point ;-) Both sha1 and md5 are designed to make it difficult to create a file with a specific checksum. This is necessary for applications like digital signatures, but it usually comes at a significant performance (and complexity) premium. CRCs, on the other hand, were meant to catch what you're trying to catch, and will usually be a lot faster. A CRC64 should be more than sufficient to catch any of the mismatches you're looking for (CRC32, such as reported by the cksum command, would probably be good enough for most purposes as well). And it will definitely be much much faster than the cryptographically secure hashes. -- Jakob Østergaard Hegelund _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
Am Wednesday 23 January 2008 schrieb Jakob Østergaard Hegelund:
> On Tuesday 22 January 2008, Arnd Baecker wrote: > > On Mon, 21 Jan 2008, Gerhard Kulzer wrote: > > > > [...] > > > > > Arnd, can you send me the script? I'd like to try too. > > > > Done (off-list, it is really not ment for general consumption ... ;-) > > > > > I just read that strigi is exactly doing what we want, comparing > > > files with sha1. Maybe sha1 is faster than md5? > > > > No idea. Maybe we should do a speed test at some point ;-) > > Both sha1 and md5 are designed to make it difficult to create a file > with a specific checksum. This is necessary for applications like > digital signatures, but it usually comes at a significant performance > (and complexity) premium. > > CRCs, on the other hand, were meant to catch what you're trying to > catch, and will usually be a lot faster. > > A CRC64 should be more than sufficient to catch any of the mismatches > you're looking for (CRC32, such as reported by the cksum command, would > probably be good enough for most purposes as well). And it will > definitely be much much faster than the cryptographically secure > hashes. I just came across an arcticle by Martin Petersen from Oracle (http://linux.sys-con.com/read/480659_1.htm, 3 pages). They implement a end-to-end data protection mechanism using checksum metadata. Citation: "This CRC is quite expensive to calculate compared to other commonly used checksums. To alleviate the impact on system performance the TCP/IP checksum algorithm is used instead. This results in an almost negligible impact on system performance." To me that sounds logical, I think we can stop searching here, it's just a matter of finding out how to best implement the TCP checksum. Gerhard -- ><((((º> ¸.·´¯`·... ><((((º> ¸.·´¯`·...¸ ><((((º> http://www.gerhard.fr _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users signature.asc (196 bytes) Download Attachment |
In reply to this post by Jakob "Ãstergaard"
Am Wednesday 23 January 2008 schrieb Jakob Østergaard Hegelund:
> On Tuesday 22 January 2008, Arnd Baecker wrote: > > On Mon, 21 Jan 2008, Gerhard Kulzer wrote: > > > > [...] > > > > > Arnd, can you send me the script? I'd like to try too. > > > > Done (off-list, it is really not ment for general consumption ... ;-) > > > > > I just read that strigi is exactly doing what we want, comparing > > > files with sha1. Maybe sha1 is faster than md5? > > > > No idea. Maybe we should do a speed test at some point ;-) > > Both sha1 and md5 are designed to make it difficult to create a file > with a specific checksum. This is necessary for applications like > digital signatures, but it usually comes at a significant performance > (and complexity) premium. > > CRCs, on the other hand, were meant to catch what you're trying to > catch, and will usually be a lot faster. > > A CRC64 should be more than sufficient to catch any of the mismatches > you're looking for (CRC32, such as reported by the cksum command, would > probably be good enough for most purposes as well). And it will > definitely be much much faster than the cryptographically secure > hashes. I have to halfway backtrack: Q: The TCP/IP checksum algorithm is notoriously bad at detecting single-bit errors. Why didn't you pick a stronger algorithm? A: Other options were contemplated, including Fletcher and XOR. The IP checksum was chosen because it was already implemented. Also, the purpose of the checksum isn't necessarily to detect bit errors. Server-class systems feature error checking and correcting memory and buses. The main intent of the checksum is to allow verification that the data buffer matches the integrity metadata. And the IP checksum handles that fine. Gerhard _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users signature.asc (196 bytes) Download Attachment |
In reply to this post by Arnd Baecker
Am Tuesday 22 January 2008 schrieb Arnd Baecker:
> On Mon, 21 Jan 2008, Gerhard Kulzer wrote: > > [...] > > > Arnd, can you send me the script? I'd like to try too. > > Done (off-list, it is really not ment for general consumption ... ;-) > > > I just read that strigi is exactly doing what we want, comparing files > > with sha1. Maybe sha1 is faster than md5? > > No idea. Maybe we should do a speed test at some point ;-) > > > Strigi creates a sha1 of every file and stores it its DB. Then it checks > > for file date changes and if yes, runs sha1 to see if it really has > > changed before grepping it thouroughly. > > Looking at > http://strigi.sourceforge.net/?q=features > it does not seem to support images? > > I don't yet fully see how strigi will finally fit into > "the" solution, this is definitively something to look at in more detail! > Thanks for the pointer. > > Best, Arnd I try to sumarize what we said last night on IRC, just as a public memo. Aim is to a) prevent corrupt images to be saved onto disk and to b) detect existing corrupt files on disk (to prevent overwriting of potentially good backups) Strategies like DIF and HARD are not available in the consumer market for another couple of years, but given the inclrease in size, speed and complexity of systems, consumer system will implement some kind of ECC (horizon ~ 3y). Protection on file system level as provided by zfs and btrfs are good but insufficient as they protect the disk only and not the transmission chain appl - OS - I/O controller - fs So we have to do it 'by hand' (meaning digikam) While saving a file after modification a) 1. keep it in memory 2. save it to disk 3. flush disk to clear cache (3a. make sure all disk internal buffers are cleared by reading other data the size of the disk buffer) = optional 5. run CRC checksum on file on disk and file in memory 5a. alternative: store checksum already in metadata and save it with file. 6. if mismatch, re-write file and repeat procedure for problem b) 7. if 5a was used, as simple scrubbing scan can be launched, manually or programmed at frequency X 7a. try to open files and look for errors produced (but this method is not reliable, I have images that show the upper part, are corrupt and produce no error message. However, the more severe error can be found) 8. generate user alert so that one can manually check between backup and original. This method may seem tedious, but has the advantage of being independent of OS and file system, works on nfs as well. Gerhard -- ><((((º> ¸.·´¯`·... ><((((º> ¸.·´¯`·...¸ ><((((º> http://www.gerhard.fr _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users signature.asc (196 bytes) Download Attachment |
On Wed, 23 Jan 2008, Gerhard Kulzer wrote:
[... previous discussion about checksum algorithms snipped ...] > Hi Arnd, > I try to sumarize what we said last night on IRC, just as a public memo. > > Aim is to > a) prevent corrupt images to be saved onto disk and to > b) detect existing corrupt files on disk > (to prevent overwriting of potentially good backups) > > Strategies like DIF and HARD are not available in the consumer market for > another couple of years, but given the inclrease in size, speed and > complexity of systems, consumer system will implement some kind of ECC > (horizon ~ 3y). > > Protection on file system level as provided by zfs and btrfs are good but > insufficient as they protect the disk only and not the transmission chain > appl - OS - I/O controller - fs > > So we have to do it 'by hand' (meaning digikam) Yes, full agreement! > While saving a file after modification a) > 1. keep it in memory > 2. save it to disk > 3. flush disk to clear cache > (3a. make sure all disk internal buffers are cleared by reading other data the > size of the disk buffer) = optional > 5. run CRC checksum on file on disk and file in memory > 5a. alternative: store checksum already in metadata and save it with file. Does this work? I mean: you compute the checksum, based on the file contents. Then you add the check-sum to the file, but then the file contents changes and thus its checksum. So there is no way to embed the correct checksum of a file in the file itself. > 6. if mismatch, re-write file and repeat procedure > > for problem b) > 7. if 5a was used, as simple scrubbing scan can be launched, manually or > programmed at frequency X > 7a. try to open files and look for errors produced (but this method is not > reliable, I have images that show the upper part, are corrupt and produce no > error message. However, the more severe error can be found) > 8. generate user alert so that one can manually check between backup and > original. > > This method may seem tedious, but has the advantage of being independent of OS > and file system, works on nfs as well. OK, the next thing is a proposal for the more technical side on how to integrate all this into digikam: A) For every new image/file getting under digikams control: compute checksum/hash and add (hash, date of the hash computation, modification time of the file on disk) to the data-base B) When editing images, use the above described procedure to ensure that the file is correctly written to disk. a) before editing: verify hash b) After editing: The corresponding (hash, date of hash, mod-time) are stored in the data-base C) What about files which get modified/added by external tools? i) when digikam is running: All such changes are detected by KDirwatch. ((Is this statement correct? E.g. even if the file date is not changed?)) a) addition of a new file: see A) b) modification of a file already in the database: Here a warning should be given. (but not much can be done, right?) Apart from this: see A) ii) When digikam is not running a) addition of a new file: see A) b) modification of a file already in the database: If the file modification time is different than the one in the data-base, this *could* be detected. However, this might take some additional time on the initial scanning. ((not sure how much time ...)) - if such a change is detected: see i)b) before - if such a change is not detected: possible problem. This can only be detected in a full check, see D) D) New Check Tool for the Data integrity: Visual side: - will display: oldest non-checked file - maybe a visual overview of files not checked (in a given time-window) (could look similar to the time line ... ;-) - reminder on startup of digikam to perform a check in regular intervals (user-specified). Actual check: - just loop over all images and recompute the hash value and update the date in the database for the last check. - a quick version could just check for modification times This tool should be stoppable/restartable at any time, and run in the background, while one can do all the normal stuff with digikam. D) Backup Here we have to ensure, that no "good" copies of the backup get destroyed by corrupted images in the main repository. Using just rsync does not seem possible: a) rsync --checksum takes a long time once the number of files is large b) It does not know about the hash stored inside digikams database This is of course a pity, because normally using unix tools is always the best option, instead of re-inventing the wheel. So we have to think about this point ... Note that this is related to - "Image backup with thumbs and metadata database for fast searching" http://bugs.kde.org/show_bug.cgi?id=133638 - "backup on dvd (and maybe sync with dvd-ram?)" http://bugs.kde.org/show_bug.cgi?id=113715 - "Sync Plugin: New Syncronisation Framework KIPI Plugin" http://bugs.kde.org/show_bug.cgi?id=143978 and to some extent also to - "Wish: Offline manager for Digikam" http://bugs.kde.org/show_bug.cgi?id=114539 - "Wish: easy transport of albums, including tags, comments, etc." http://bugs.kde.org/show_bug.cgi?id=103201 For the moment I think we should postpone the details for this point until A) - C) are implemented and tested. External tools could then use the information in the data-base to test the right approach for D). Comments are very much appreciated! (And: Should we turn this into a BKO wish?) Best, Arnd _______________________________________________ Digikam-users mailing list [hidden email] https://mail.kde.org/mailman/listinfo/digikam-users |
Free forum by Nabble | Edit this page |