[prev in list] [next in list] [prev in thread] [next in thread]
List: sleuthkit-users
Subject: Re: [sleuthkit-users] Slow Add Image Process Cause
From: Luís_Filipe_Nassif <lfcnassif () gmail ! com>
Date: 2014-05-15 18:07:31
Message-ID: CACknrh2RusFbiYzS2ug2=r+0HHfdhx0PNppEqtegY+LuVwuBTw () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
Hi Brian,
Pushed latest version of develop. Tested with the same 3 ntfs images and 1
hfs w/ 3million entries. No misses found and loadDd finished very fast.
Thanks,
Luis Nassif
2014-05-14 16:20 GMT-03:00 Brian Carrier <carrier@sleuthkit.org>:
> Hi Luis,
>
> Pull the latest version of develop. I'm not seeing any cache misses on
> any images and the regression tests are all good.
>
> For reference on what has gone on in this thread:
> - NTFS has a sequence value that it increments each time that an MFT entry
> is re-used.
> - TSK (and other tools) used to ignore this value, but that changed about
> a year ago when a law enforcement person showed me some data where half of
> the tools showed an important file in one folder and another half showed it
> in a different folder. The difference was because some tools were ignoring
> the sequence and putting the file in a less than accurate place.
> - The tsk_loaddb tool (which is what Autopsy basically uses to make the
> SQLite Database) has a caching mechanism to find information about
> previously analyzed files so that it doesn't have to query the database
> each time, which is slow.
> - This exercise brought to light two bugs. One was that the wrong ID was
> being used in the cache lookup and it never hit. The other is even after
> that was fixed, some deleted files were still not being found in the cache
> because the wrong sequence value was being used.
> - In both cases, we weren't getting wrong results. We were simply being
> inefficient and that is why the ingest took longer.
>
> thanks,
> brian
>
>
> On May 13, 2014, at 6:08 PM, Brian Carrier <carrier@sleuthkit.org> wrote:
>
> > Hi Luis,
> >
> > Yea, A couple of my regression tests found some strange results with
> that fix. I pushed a new one up that is less strict on the filtering of
> the deleted files. There is still one issue I need to work out though.
> >
> > thanks,
> > brian
> >
> >
> >
> > On May 13, 2014, at 5:37 PM, Luís Filipe Nassif <lfcnassif@gmail.com>
> wrote:
> >
> >> Forgot to mention, now loaddb is adding a different number of entries
> compared to previous version. Is it ok?
> >>
> >> Thank you
> >> Nassif
> >>
> >> Em 13/05/2014 18:19, "Luís Filipe Nassif" <lfcnassif@gmail.com>
> escreveu:
> >> Excelent, Brian! All misses were resolved. Tested in the 3 ntfs images
> above and 1 hfs with 3 million entries.
> >>
> >> Also, images taking hours or days to finish now takes minutes with both
> patches!
> >>
> >> Thank you very much for addressing this issue.
> >> Nassif
> >>
> >> Nope, this one from earlier today:
> >>
> >>
> https://github.com/sleuthkit/sleuthkit/commit/f0672805c18a634ffffeff1f39f793501ddb7702
> >>
> >>
> >> On May 12, 2014, at 3:23 PM, Luís Filipe Nassif <lfcnassif@gmail.com>
> wrote:
> >>
> >>> Hi Brian,
> >>>
> >>> Do you mean this fix?
> https://github.com/sleuthkit/sleuthkit/commit/7b257b6c8252f9e9a7202990710e3a0ef31bf6b7
> >>>
> >>> It resolved a lot of the misses as i described before (05/02). But i
> am still getting thousands of misses. So I took a look at 2 generated
> sqlites and discovered that the remaining misses were from deleted folders.
> I think that deleted folders are not being stored into local cache...
> >>>
> >>>
> >>>
> >>>
> >>> 2014-05-12 13:55 GMT-03:00 Brian Carrier <carrier@sleuthkit.org>:
> >>> Hi Luis,
> >>>
> >>> The develop branch on github has a fix that removed the remaining
> misses on my test image, but I had far fewer than you did. Can you pull
> it and try that one?
> >>>
> >>> thanks,
> >>> brian
> >>>
> >>>
> >>>
> >>>
> >>> On May 8, 2014, at 9:57 AM, Luís Filipe Nassif <lfcnassif@gmail.com>
> wrote:
> >>>
> >>>> Hi Brian and Simson,
> >>>>
> >>>> I have done a number of tests since yesterday. I have not restarted
> the computer between the tests, because i think it would be better to use
> the OS IO cache to focus on CPU processing time, without io interference,
> except for the first run. I used another computer, faster than before.
> Results below:
> >>>>
> >>>> ntfs image w/ 127.408 files 1 2 3 4
> >>>> no patch (only Brian's fix) 3min 27s 3min 3s 3min 3s 3min 2s
> >>>> disabling database parent_id look up 35s 11s 11s 12s
> >>>> index on meta_addr,fs_obj_id and commit for each 5000 files 4min
> 11s 3min 48s 3min 48s 3min 47s
> >>>>
> >>>>
> >>>> ntfs image w/ 216.693 files 1 2 3 4
> >>>> no patch (only Brian's fix) 5min 53s 4min 59s 4min
> 58s 5min 2s
> >>>> disabling database parent_id look up 2min 8s 1min 21s 1min
> 21s 1min 21s
> >>>> index on meta_addr,fs_obj_id and commit for each 5000 files 6min
> 38s 5min 46s 5min 43s 5min 43s
> >>>>
> >>>>
> >>>> ntfs image w/ 433.321 files 1 2 3 4
> >>>> no patch (only Brian's fix) 21min 38s 21min 40s 21min
> 10s 21min 10s
> >>>> disabling database parent_id look up 3min 59s 2min 47s
> 2min 47s 2min 47s
> >>>> index on meta_addr,fs_obj_id and commit for each 5000 files (not
> run based on previous results)
> >>>>
> >>>> So, Brian was right, the commits increased processing time. And as
> you can see, it would be great if it is possible to eliminate the misses
> and remove the database parent_id look up.
> >>>>
> >>>> With that in mind, I took a look at one of the sqlites, and I think I
> discovered the cause of a lot of the misses (maybe all of them). The misses
> happened with deleted files inside deleted folders. I think deleted folders
> can not being stored into local cache.
> >>>>
> >>>> Regards,
> >>>> Nassif
> >>>>
> >>>>
> >>>> 2014-05-07 10:53 GMT-03:00 Brian Carrier <carrier@sleuthkit.org>:
> >>>> Hi Nassif,
> >>>>
> >>>> As Simson mentioned, the current setup was intended to be the
> fastest. Doing frequent commits takes longer and more indexes makes
> commits take longer. This is the only process that we know about that does
> this type of lookup and would use those indexes.
> >>>>
> >>>> The bigger question for me is why we are getting these cache misses
> and I need to spend some more time with some more images to find out. The
> lookup is to find the ID of the parent and we process from the root
> directory down. So, in theory, we have already processed the parent folder
> before the children and it should be in the cache. We need to figure out
> why the parent isn't in the cache...
> >>>>
> >>>> brian
> >>>>
> >>>>
> >>>>
> >>>> On May 7, 2014, at 8:24 AM, Luís Filipe Nassif <lfcnassif@gmail.com>
> wrote:
> >>>>
> >>>>> I have done one last test, because it was very strange to me that
> indexing meta_addr and fs_obj_id had not improved the parent_id lookup. We
> suspected that the indexes were not being used by sqlite, maybe because the
> whole data is not commited before add image process finishes (i am not a
> sqlite expert, is it possible?). So we inserted a commit for each 5.000
> files added to database. The add image process time decreased from 1hour to
> 30min, so we think that the indexes were not being used.
> >>>>>
> >>>>> Why add image process do not commit the data while it is being added
> to database?
> >>>>>
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-02 13:37 GMT-03:00 Luís Filipe Nassif <lfcnassif@gmail.com>:
> >>>>> Fixing my last email, the test was run with the indexes AND Brian's
> fix. Then I removed the index patch and loadDb took the same 1 hour to
> finish with only Brian's fix. So the index patch did not help improving
> database look up for parent_id.
> >>>>>
> >>>>> Sorry for mistake,
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <lfcnassif@gmail.com>:
> >>>>>
> >>>>> I tested loadDb with a create index on meta_addr and fs_obj_id
> patch. The image with 433.321 files, previously taking 2h45min to load, now
> takes 1h to finish loadDb with the indexes. That is a good speed up, but
> completely disabling the database parent_id look up, it only takes 7min to
> finish. Is there another thing we can do to improve the parent_id database
> look up?
> >>>>>
> >>>>> Regards,
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <lfcnassif@gmail.com>:
> >>>>>
> >>>>> Ok, tested in 2 images. Fix resolved a lot of misses:
> >>>>>
> >>>>> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses
> >>>>> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses
> >>>>>
> >>>>> I also think creating an index on tsk_files(meta_addr) and
> tsk_files(fs_obj_id) could help improving the database look up for those
> deleted files not found in local cache, what do you think? The database
> look up seems too slow, as described in my first email.
> >>>>>
> >>>>> Thank you for taking a look so quickly.
> >>>>> Nassif
> >>>>>
> >>>>>
> >>>>> 2014-05-01 23:47 GMT-03:00 Brian Carrier <carrier@sleuthkit.org>:
> >>>>>
> >>>>> Well that was an easy and embarrassing fix:
> >>>>>
> >>>>> if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) {
> >>>>> - seq = fs_file->name->meta_seq;
> >>>>> + seq = fs_file->name->par_seq;
> >>>>> }
> >>>>>
> >>>>> Turns out we've been having a lot of cache misses because of this
> stupid bug. Can you replace that line and see if it helps. It certainly
> did on my test image.
> >>>>>
> >>>>> thanks,
> >>>>> brian
> >>>>>
> >>>>>
> >>>>> On May 1, 2014, at 10:24 PM, Brian Carrier <carrier@sleuthkit.org>
> wrote:
> >>>>>
> >>>>>> Thanks for the tests. I wonder if it has to do with an incorrect
> sequence number. NTFS increments the sequence number each time a file is
> re-allocated. Deleted orphan files could be getting misses. I'll add some
> logging on my system and see what kind of misses I get.
> >>>>>>
> >>>>>> brian
> >>>>>>
> >>>>>> On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <lfcnassif@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> Ok, tests 1 and 3 done. I do not have sleuthkit code inside an
> ide, so did not use breakpoints. Instead, I changed
> TskDbSqlite::findParObjId() to return the parent_meta_addr when it is not
> found and return 1 when it is found in the cache map.
> >>>>>>>
> >>>>>>> Performing queries on the generated sqlite, there were 19.558
> cache misses from an image with 3 ntfs partitions and 127.408 files. I
> confirmed that many parent_meta_addr missed from cache (now stored in
> tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths
> corresponding to these meta_addr are parents of those files whose
> processing have not found them in cache.
> >>>>>>>
> >>>>>>> Other tests resulted in:
> >>>>>>> 182.256 cache misses from 433.321 files (ntfs)
> >>>>>>> 892.359 misses from 1.811.393 files (ntfs)
> >>>>>>> 169.819 misses from 3.177.917 files (hfs)
> >>>>>>>
> >>>>>>> Luis Nassif
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <lfcnassif@gmail.com
> >:
> >>>>>>> Forgot to mention: we are using sleuthkit 4.1.3
> >>>>>>>
> >>>>>>> Em 01/05/2014 16:09, "Luís Filipe Nassif" <lfcnassif@gmail.com>
> escreveu:
> >>>>>>>
> >>>>>>> Hi Brian,
> >>>>>>>
> >>>>>>> The 3 cases above were ntfs. I also tested with hfs and canceled
> loaddb after 1 day. The modified version finished after 8hours and added
> about 3 million entries. We will try to do the tests you have suggested.
> >>>>>>>
> >>>>>>> Em 01/05/2014 15:48, "Brian Carrier" <carrier@sleuthkit.org>
> escreveu:
> >>>>>>> Hi Luis,
> >>>>>>>
> >>>>>>> What kind of file system was it? I fixed a bug a little while ago
> in that code for HFS file systems that resulted in a lot of cache misses.
> >>>>>>>
> >>>>>>> In theory, everything should be cached. It sounds like a bug if
> you are getting so many misses. The basic idea of this code is that
> everything in the DB gets assigned a unique object ID and we make
> associations between files and their parent folder's unique ID.
> >>>>>>>
> >>>>>>> Since you seem to be comfortable with a debugger in the code, can
> you set a breakpoint for when the miss happens and:
> >>>>>>> 1) Determine the path of the file that was being added to the DB
> and the parent address that was trying to be found.
> >>>>>>> 2) Use the 'ffind' TSK tool to then map that parent address to a
> path. Is it a subset of the path from #1?
> >>>>>>> 3) Open the DB in a SQLite tool and do something like this:
> >>>>>>>
> >>>>>>> SELECT * from tsk_files where meta_addr == META_ADDR_FROM_ABOVE
> >>>>>>>
> >>>>>>> Is it in the DB?
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> brian
> >>>>>>>
> >>>>>>>
> >>>>>>> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif <
> lfcnassif@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We have investigated a bit why the add image process is too slow
> in some cases. The add image process time seems to be quadratic with the
> number of files in the image.
> >>>>>>>>
> >>>>>>>> We detected that the function TskDbSqlite::findParObjId(), in
> db_sqlite.cpp, is not finding the parent_meta_addr -> parent_file_id
> mapping in the local cache for a lot of files, causing it to search for the
> mapping in the database (not sure if it is an non-indexed search?)
> >>>>>>>>
> >>>>>>>> For testing purposes, we added a "return 1;" line right after the
> cache look up, disabling the database look up, and this resulted in great
> speed ups:
> >>>>>>>>
> >>>>>>>> number of files / default load_db time / patched load_db time
> >>>>>>>> ~80.000 / 20min / 2min
> >>>>>>>> ~300.000 / 3h / 7min
> >>>>>>>> ~700.000 / 48h / 27min
> >>>>>>>>
> >>>>>>>> We wonder if it is possible to store all par_meta_addr -> par_id
> mappings into local cache (better) or doing an improved (indexed?) search
> for the mapping in the database. We think that someone with more knowledge
> of load_db code could help a lot here.
> >>>>>>>>
> ------------------------------------------------------------------------------
> >>>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
> FREE
> >>>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get
> >>>>>>>> unparalleled scalability from the best Selenium testing platform
> available.
> >>>>>>>> Simple to use. Nothing to install. Get started now for free."
> >>>>>>>>
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> >>>>>>>> sleuthkit-users mailing list
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>>>>> http://www.sleuthkit.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ------------------------------------------------------------------------------
> >>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
> FREE
> >>>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get
> >>>>>>> unparalleled scalability from the best Selenium testing platform
> available.
> >>>>>>> Simple to use. Nothing to install. Get started now for free."
> >>>>>>>
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> >>>>>>> sleuthkit-users mailing list
> >>>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>>>> http://www.sleuthkit.org
> >>>>>>
> >>>>>>
> >>>>>>
> ------------------------------------------------------------------------------
> >>>>>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For
> FREE
> >>>>>> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get
> >>>>>> unparalleled scalability from the best Selenium testing platform
> available.
> >>>>>> Simple to use. Nothing to install. Get started now for free."
> >>>>>> http://p.sf.net/sfu/SauceLabs
> >>>>>> _______________________________________________
> >>>>>> sleuthkit-users mailing list
> >>>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>>> http://www.sleuthkit.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> ------------------------------------------------------------------------------
> >>>>> Is your legacy SCM system holding you back? Join Perforce May 7 to
> find out:
> >>>>> • 3 signs your SCM is hindering your productivity
> >>>>> • Requirements for releasing software faster
> >>>>> • Expert tips and advice for migrating your SCM now
> >>>>>
> http://p.sf.net/sfu/perforce_______________________________________________
> >>>>> sleuthkit-users mailing list
> >>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>>> http://www.sleuthkit.org
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------------
> >>>> Is your legacy SCM system holding you back? Join Perforce May 7 to
> find out:
> >>>> • 3 signs your SCM is hindering your productivity
> >>>> • Requirements for releasing software faster
> >>>> • Expert tips and advice for migrating your SCM now
> >>>>
> http://p.sf.net/sfu/perforce_______________________________________________
> >>>> sleuthkit-users mailing list
> >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >>>> http://www.sleuthkit.org
> >>>
> >>>
> >>
> >>
> ------------------------------------------------------------------------------
> >> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> >> Instantly run your Selenium tests across 300+ browser/OS combos.
> >> Get unparalleled scalability from the best Selenium testing platform
> available
> >> Simple to use. Nothing to install. Get started now for free."
> >>
> http://p.sf.net/sfu/SauceLabs_______________________________________________
> >> sleuthkit-users mailing list
> >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> >> http://www.sleuthkit.org
> >
> >
> >
> ------------------------------------------------------------------------------
> > "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> > Instantly run your Selenium tests across 300+ browser/OS combos.
> > Get unparalleled scalability from the best Selenium testing platform
> available
> > Simple to use. Nothing to install. Get started now for free."
> > http://p.sf.net/sfu/SauceLabs
> > _______________________________________________
> > sleuthkit-users mailing list
> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> > http://www.sleuthkit.org
>
>
[Attachment #5 (text/html)]
<div dir="ltr">Hi Brian,<div><br></div><div>Pushed latest version of develop. Tested \
with the same 3 ntfs images and 1 hfs w/ 3million entries. No misses found and loadDd \
finished very fast.</div><div><br></div><div>Thanks,</div> <div>Luis \
Nassif</div></div><div class="gmail_extra"><br><br><div \
class="gmail_quote">2014-05-14 16:20 GMT-03:00 Brian Carrier <span dir="ltr"><<a \
href="mailto:carrier@sleuthkit.org" \
target="_blank">carrier@sleuthkit.org</a>></span>:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">Hi Luis,<br> <br>
Pull the latest version of develop. I'm not seeing any cache misses on any \
images and the regression tests are all good.<br> <br>
For reference on what has gone on in this thread:<br>
- NTFS has a sequence value that it increments each time that an MFT entry is \
re-used.<br>
- TSK (and other tools) used to ignore this value, but that changed about a year ago \
when a law enforcement person showed me some data where half of the tools showed an \
important file in one folder and another half showed it in a different folder. The \
difference was because some tools were ignoring the sequence and putting the file in \
a less than accurate place.<br>
- The tsk_loaddb tool (which is what Autopsy basically uses to make the SQLite \
Database) has a caching mechanism to find information about previously analyzed files \
so that it doesn't have to query the database each time, which is slow.<br>
- This exercise brought to light two bugs. One was that the wrong ID was being used \
in the cache lookup and it never hit. The other is even after that was fixed, some \
deleted files were still not being found in the cache because the wrong sequence \
value was being used.<br>
- In both cases, we weren't getting wrong results. We were simply being \
inefficient and that is why the ingest took longer.<br> <br>
thanks,<br>
brian<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
On May 13, 2014, at 6:08 PM, Brian Carrier <<a \
href="mailto:carrier@sleuthkit.org">carrier@sleuthkit.org</a>> wrote:<br> <br>
> Hi Luis,<br>
><br>
> Yea, A couple of my regression tests found some strange results with that fix. \
I pushed a new one up that is less strict on the filtering of the deleted files. \
There is still one issue I need to work out though.<br>
><br>
> thanks,<br>
> brian<br>
><br>
><br>
><br>
> On May 13, 2014, at 5:37 PM, Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> wrote:<br> ><br>
>> Forgot to mention, now loaddb is adding a different number of entries \
compared to previous version. Is it ok?<br> >><br>
>> Thank you<br>
>> Nassif<br>
>><br>
>> Em 13/05/2014 18:19, "Luís Filipe Nassif" <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> escreveu:<br> >> \
Excelent, Brian! All misses were resolved. Tested in the 3 ntfs images above and 1 \
hfs with 3 million entries.<br> >><br>
>> Also, images taking hours or days to finish now takes minutes with both \
patches!<br> >><br>
>> Thank you very much for addressing this issue.<br>
>> Nassif<br>
>><br>
>> Nope, this one from earlier today:<br>
>><br>
>> <a href="https://github.com/sleuthkit/sleuthkit/commit/f0672805c18a634ffffeff1f39f793501ddb7702" \
target="_blank">https://github.com/sleuthkit/sleuthkit/commit/f0672805c18a634ffffeff1f39f793501ddb7702</a><br>
>><br>
>><br>
>> On May 12, 2014, at 3:23 PM, Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> wrote:<br> >><br>
>>> Hi Brian,<br>
>>><br>
>>> Do you mean this fix? <a \
href="https://github.com/sleuthkit/sleuthkit/commit/7b257b6c8252f9e9a7202990710e3a0ef31bf6b7" \
target="_blank">https://github.com/sleuthkit/sleuthkit/commit/7b257b6c8252f9e9a7202990710e3a0ef31bf6b7</a><br>
>>><br>
>>> It resolved a lot of the misses as i described before (05/02). But i am \
still getting thousands of misses. So I took a look at 2 generated sqlites and \
discovered that the remaining misses were from deleted folders. I think that deleted \
folders are not being stored into local cache...<br>
>>><br>
>>><br>
>>><br>
>>><br>
>>> 2014-05-12 13:55 GMT-03:00 Brian Carrier <<a \
href="mailto:carrier@sleuthkit.org">carrier@sleuthkit.org</a>>:<br> >>> \
Hi Luis,<br> >>><br>
>>> The develop branch on github has a fix that removed the remaining misses \
on my test image, but I had far fewer than you did. Can you pull it and try that \
one?<br> >>><br>
>>> thanks,<br>
>>> brian<br>
>>><br>
>>><br>
>>><br>
>>><br>
>>> On May 8, 2014, at 9:57 AM, Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> wrote:<br> \
>>><br> >>>> Hi Brian and Simson,<br>
>>>><br>
>>>> I have done a number of tests since yesterday. I have not restarted \
the computer between the tests, because i think it would be better to use the OS IO \
cache to focus on CPU processing time, without io interference, except for the first \
run. I used another computer, faster than before. Results below:<br>
>>>><br>
>>>> ntfs image w/ 127.408 files 1 2 3 \
4<br> >>>> no patch (only Brian's fix) 3min 27s 3min 3s \
3min 3s 3min 2s<br> >>>> disabling database parent_id look up 35s \
11s 11s 12s<br> >>>> index on meta_addr,fs_obj_id and commit \
for each 5000 files 4min 11s 3min 48s 3min 48s \
3min 47s<br> >>>><br>
>>>><br>
>>>> ntfs image w/ 216.693 files 1 2 3 \
4<br> >>>> no patch (only Brian's fix) 5min 53s 4min \
59s 4min 58s 5min 2s<br> >>>> disabling database \
parent_id look up 2min 8s 1min 21s 1min 21s 1min 21s<br> \
>>>> index on meta_addr,fs_obj_id and commit for each 5000 files 6min \
38s 5min 46s 5min 43s 5min 43s<br> \
>>>><br> >>>><br>
>>>> ntfs image w/ 433.321 files 1 2 3 \
4<br> >>>> no patch (only Brian's fix) 21min 38s 21min \
40s 21min 10s 21min 10s<br> >>>> disabling database \
parent_id look up 3min 59s 2min 47s 2min 47s 2min \
47s<br> >>>> index on meta_addr,fs_obj_id and commit for each 5000 files \
(not run based on previous results)<br> >>>><br>
>>>> So, Brian was right, the commits increased processing time. And as \
you can see, it would be great if it is possible to eliminate the misses and remove \
the database parent_id look up.<br> >>>><br>
>>>> With that in mind, I took a look at one of the sqlites, and I think \
I discovered the cause of a lot of the misses (maybe all of them). The misses \
happened with deleted files inside deleted folders. I think deleted folders can not \
being stored into local cache.<br>
>>>><br>
>>>> Regards,<br>
>>>> Nassif<br>
>>>><br>
>>>><br>
>>>> 2014-05-07 10:53 GMT-03:00 Brian Carrier <<a \
href="mailto:carrier@sleuthkit.org">carrier@sleuthkit.org</a>>:<br> \
>>>> Hi Nassif,<br> >>>><br>
>>>> As Simson mentioned, the current setup was intended to be the \
fastest. Doing frequent commits takes longer and more indexes makes commits take \
longer. This is the only process that we know about that does this type of lookup and \
would use those indexes.<br>
>>>><br>
>>>> The bigger question for me is why we are getting these cache misses \
and I need to spend some more time with some more images to find out. The lookup is \
to find the ID of the parent and we process from the root directory down. So, in \
theory, we have already processed the parent folder before the children and it should \
be in the cache. We need to figure out why the parent isn't in the cache...<br>
>>>><br>
>>>> brian<br>
>>>><br>
>>>><br>
>>>><br>
>>>> On May 7, 2014, at 8:24 AM, Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> wrote:<br> \
>>>><br> >>>>> I have done one last test, because it was \
very strange to me that indexing meta_addr and fs_obj_id had not improved the \
parent_id lookup. We suspected that the indexes were not being used by sqlite, maybe \
because the whole data is not commited before add image process finishes (i am not a \
sqlite expert, is it possible?). So we inserted a commit for each 5.000 files added \
to database. The add image process time decreased from 1hour to 30min, so we think \
that the indexes were not being used.<br>
>>>>><br>
>>>>> Why add image process do not commit the data while it is being \
added to database?<br> >>>>><br>
>>>>> Nassif<br>
>>>>><br>
>>>>><br>
>>>>> 2014-05-02 13:37 GMT-03:00 Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>>:<br> \
>>>>> Fixing my last email, the test was run with the indexes AND \
Brian's fix. Then I removed the index patch and loadDb took the same 1 hour to \
finish with only Brian's fix. So the index patch did not help improving database \
look up for parent_id.<br>
>>>>><br>
>>>>> Sorry for mistake,<br>
>>>>> Nassif<br>
>>>>><br>
>>>>><br>
>>>>> 2014-05-02 10:54 GMT-03:00 Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>>:<br> \
>>>>><br> >>>>> I tested loadDb with a create index on \
meta_addr and fs_obj_id patch. The image with 433.321 files, previously taking \
2h45min to load, now takes 1h to finish loadDb with the indexes. That is a good speed \
up, but completely disabling the database parent_id look up, it only takes 7min to \
finish. Is there another thing we can do to improve the parent_id database look \
up?<br>
>>>>><br>
>>>>> Regards,<br>
>>>>> Nassif<br>
>>>>><br>
>>>>><br>
>>>>> 2014-05-02 9:35 GMT-03:00 Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>>:<br> \
>>>>><br> >>>>> Ok, tested in 2 images. Fix resolved a \
lot of misses:<br> >>>>><br>
>>>>> ntfs image w/ 127.408 files: from 19.558 to 6.511 misses<br>
>>>>> ntfs image w/ 433.321 files: from 182.256 to 19.908 misses<br>
>>>>><br>
>>>>> I also think creating an index on tsk_files(meta_addr) and \
tsk_files(fs_obj_id) could help improving the database look up for those deleted \
files not found in local cache, what do you think? The database look up seems too \
slow, as described in my first email.<br>
>>>>><br>
>>>>> Thank you for taking a look so quickly.<br>
>>>>> Nassif<br>
>>>>><br>
>>>>><br>
>>>>> 2014-05-01 23:47 GMT-03:00 Brian Carrier <<a \
href="mailto:carrier@sleuthkit.org">carrier@sleuthkit.org</a>>:<br> \
>>>>><br> >>>>> Well that was an easy and embarrassing \
fix:<br> >>>>><br>
>>>>> if (TSK_FS_TYPE_ISNTFS(fs_file->fs_info->ftype)) \
{<br> >>>>> - seq = fs_file->name->meta_seq;<br>
>>>>> + seq = fs_file->name->par_seq;<br>
>>>>> }<br>
>>>>><br>
>>>>> Turns out we've been having a lot of cache misses because \
of this stupid bug. Can you replace that line and see if it helps. It certainly did \
on my test image.<br> >>>>><br>
>>>>> thanks,<br>
>>>>> brian<br>
>>>>><br>
>>>>><br>
>>>>> On May 1, 2014, at 10:24 PM, Brian Carrier <<a \
href="mailto:carrier@sleuthkit.org">carrier@sleuthkit.org</a>> wrote:<br> \
>>>>><br> >>>>>> Thanks for the tests. I wonder if \
it has to do with an incorrect sequence number. NTFS increments the sequence number \
each time a file is re-allocated. Deleted orphan files could be getting misses. \
I'll add some logging on my system and see what kind of misses I get.<br>
>>>>>><br>
>>>>>> brian<br>
>>>>>><br>
>>>>>> On May 1, 2014, at 8:39 PM, Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> wrote:<br> \
>>>>>><br> >>>>>>> Ok, tests 1 and 3 done. I \
do not have sleuthkit code inside an ide, so did not use breakpoints. Instead, I \
changed TskDbSqlite::findParObjId() to return the parent_meta_addr when it is not \
found and return 1 when it is found in the cache map.<br>
>>>>>>><br>
>>>>>>> Performing queries on the generated sqlite, there were \
19.558 cache misses from an image with 3 ntfs partitions and 127.408 files. I \
confirmed that many parent_meta_addr missed from cache (now stored in \
tsk_objects.par_obj_id) are into tsk_files.meta_addr. The complete paths \
corresponding to these meta_addr are parents of those files whose processing have not \
found them in cache.<br>
>>>>>>><br>
>>>>>>> Other tests resulted in:<br>
>>>>>>> 182.256 cache misses from 433.321 files (ntfs)<br>
>>>>>>> 892.359 misses from 1.811.393 files (ntfs)<br>
>>>>>>> 169.819 misses from 3.177.917 files (hfs)<br>
>>>>>>><br>
>>>>>>> Luis Nassif<br>
>>>>>>><br>
>>>>>>><br>
>>>>>>><br>
>>>>>>> 2014-05-01 16:14 GMT-03:00 Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>>:<br> \
>>>>>>> Forgot to mention: we are using sleuthkit 4.1.3<br> \
>>>>>>><br> >>>>>>> Em 01/05/2014 16:09, \
"Luís Filipe Nassif" <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> escreveu:<br> \
>>>>>>><br> >>>>>>> Hi Brian,<br>
>>>>>>><br>
>>>>>>> The 3 cases above were ntfs. I also tested with hfs and \
canceled loaddb after 1 day. The modified version finished after 8hours and added \
about 3 million entries. We will try to do the tests you have suggested.<br>
>>>>>>><br>
>>>>>>> Em 01/05/2014 15:48, "Brian Carrier" <<a \
href="mailto:carrier@sleuthkit.org">carrier@sleuthkit.org</a>> escreveu:<br> \
>>>>>>> Hi Luis,<br> >>>>>>><br>
>>>>>>> What kind of file system was it? I fixed a bug a little \
while ago in that code for HFS file systems that resulted in a lot of cache \
misses.<br> >>>>>>><br>
>>>>>>> In theory, everything should be cached. It sounds like \
a bug if you are getting so many misses. The basic idea of this code is that \
everything in the DB gets assigned a unique object ID and we make associations \
between files and their parent folder's unique ID.<br>
>>>>>>><br>
>>>>>>> Since you seem to be comfortable with a debugger in the \
code, can you set a breakpoint for when the miss happens and:<br> \
>>>>>>> 1) Determine the path of the file that was being added \
to the DB and the parent address that was trying to be found.<br> \
>>>>>>> 2) Use the 'ffind' TSK tool to then map that \
parent address to a path. Is it a subset of the path from #1?<br> \
>>>>>>> 3) Open the DB in a SQLite tool and do something like \
this:<br> >>>>>>><br>
>>>>>>> SELECT * from tsk_files where meta_addr == \
META_ADDR_FROM_ABOVE<br> >>>>>>><br>
>>>>>>> Is it in the DB?<br>
>>>>>>><br>
>>>>>>> Thanks!<br>
>>>>>>><br>
>>>>>>> brian<br>
>>>>>>><br>
>>>>>>><br>
>>>>>>> On May 1, 2014, at 11:58 AM, Luís Filipe Nassif <<a \
href="mailto:lfcnassif@gmail.com">lfcnassif@gmail.com</a>> wrote:<br> \
>>>>>>><br> >>>>>>>> Hi,<br>
>>>>>>>><br>
>>>>>>>> We have investigated a bit why the add image process \
is too slow in some cases. The add image process time seems to be quadratic with the \
number of files in the image.<br> >>>>>>>><br>
>>>>>>>> We detected that the function \
TskDbSqlite::findParObjId(), in db_sqlite.cpp, is not finding the parent_meta_addr \
-> parent_file_id mapping in the local cache for a lot of files, causing it to \
search for the mapping in the database (not sure if it is an non-indexed search?)<br>
>>>>>>>><br>
>>>>>>>> For testing purposes, we added a "return \
1;" line right after the cache look up, disabling the database look up, and this \
resulted in great speed ups:<br> >>>>>>>><br>
>>>>>>>> number of files / default load_db time / patched \
load_db time<br> >>>>>>>> ~80.000 / 20min / 2min<br>
>>>>>>>> ~300.000 / 3h / 7min<br>
>>>>>>>> ~700.000 / 48h / 27min<br>
>>>>>>>><br>
>>>>>>>> We wonder if it is possible to store all \
par_meta_addr -> par_id mappings into local cache (better) or doing an improved \
(indexed?) search for the mapping in the database. We think that someone with more \
knowledge of load_db code could help a lot here.<br>
>>>>>>>> \
------------------------------------------------------------------------------<br> \
>>>>>>>> "Accelerate Dev Cycles with Automated \
Cross-Browser Testing - For FREE<br> >>>>>>>> Instantly run \
your Selenium tests across 300+ browser/OS combos. Get<br> \
>>>>>>>> unparalleled scalability from the best Selenium \
testing platform available.<br> >>>>>>>> Simple to use. \
Nothing to install. Get started now for free."<br> \
>>>>>>>> <a \
href="http://p.sf.net/sfu/SauceLabs_______________________________________________" \
target="_blank">http://p.sf.net/sfu/SauceLabs_______________________________________________</a><br>
>>>>>>>> sleuthkit-users mailing list<br>
>>>>>>>> <a \
href="https://lists.sourceforge.net/lists/listinfo/sleuthkit-users" \
target="_blank">https://lists.sourceforge.net/lists/listinfo/sleuthkit-users</a><br> \
>>>>>>>> <a href="http://www.sleuthkit.org" \
target="_blank">http://www.sleuthkit.org</a><br> >>>>>>><br>
>>>>>>><br>
>>>>>>> \
------------------------------------------------------------------------------<br> \
>>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser \
Testing - For FREE<br> >>>>>>> Instantly run your Selenium tests \
across 300+ browser/OS combos. Get<br> >>>>>>> unparalleled \
scalability from the best Selenium testing platform available.<br> \
>>>>>>> Simple to use. Nothing to install. Get started now for \
free."<br> >>>>>>> <a \
href="http://p.sf.net/sfu/SauceLabs_______________________________________________" \
target="_blank">http://p.sf.net/sfu/SauceLabs_______________________________________________</a><br>
>>>>>>> sleuthkit-users mailing list<br>
>>>>>>> <a \
href="https://lists.sourceforge.net/lists/listinfo/sleuthkit-users" \
target="_blank">https://lists.sourceforge.net/lists/listinfo/sleuthkit-users</a><br> \
>>>>>>> <a href="http://www.sleuthkit.org" \
target="_blank">http://www.sleuthkit.org</a><br> >>>>>><br>
>>>>>><br>
>>>>>> \
------------------------------------------------------------------------------<br> \
>>>>>> "Accelerate Dev Cycles with Automated Cross-Browser \
Testing - For FREE<br> >>>>>> Instantly run your Selenium tests \
across 300+ browser/OS combos. Get<br> >>>>>> unparalleled \
scalability from the best Selenium testing platform available.<br> \
>>>>>> Simple to use. Nothing to install. Get started now for \
free."<br> >>>>>> <a href="http://p.sf.net/sfu/SauceLabs" \
target="_blank">http://p.sf.net/sfu/SauceLabs</a><br> >>>>>> \
_______________________________________________<br> >>>>>> \
sleuthkit-users mailing list<br> >>>>>> <a \
href="https://lists.sourceforge.net/lists/listinfo/sleuthkit-users" \
target="_blank">https://lists.sourceforge.net/lists/listinfo/sleuthkit-users</a><br> \
>>>>>> <a href="http://www.sleuthkit.org" \
target="_blank">http://www.sleuthkit.org</a><br> >>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> \
------------------------------------------------------------------------------<br> \
>>>>> Is your legacy SCM system holding you back? Join Perforce May 7 \
to find out:<br> >>>>> &#149; 3 signs your SCM is hindering your \
productivity<br> >>>>> &#149; Requirements for releasing software \
faster<br> >>>>> &#149; Expert tips and advice for migrating your \
SCM now<br> >>>>> <a \
href="http://p.sf.net/sfu/perforce_______________________________________________" \
target="_blank">http://p.sf.net/sfu/perforce_______________________________________________</a><br>
>>>>> sleuthkit-users mailing list<br>
>>>>> <a \
href="https://lists.sourceforge.net/lists/listinfo/sleuthkit-users" \
target="_blank">https://lists.sourceforge.net/lists/listinfo/sleuthkit-users</a><br> \
>>>>> <a href="http://www.sleuthkit.org" \
target="_blank">http://www.sleuthkit.org</a><br> >>>><br>
>>>><br>
>>>> ------------------------------------------------------------------------------<br>
>>>> Is your legacy SCM system holding you back? Join Perforce May 7 to \
find out:<br> >>>> &#149; 3 signs your SCM is hindering your \
productivity<br> >>>> &#149; Requirements for releasing software \
faster<br> >>>> &#149; Expert tips and advice for migrating your SCM \
now<br> >>>> <a \
href="http://p.sf.net/sfu/perforce_______________________________________________" \
target="_blank">http://p.sf.net/sfu/perforce_______________________________________________</a><br>
>>>> sleuthkit-users mailing list<br>
>>>> <a href="https://lists.sourceforge.net/lists/listinfo/sleuthkit-users" \
target="_blank">https://lists.sourceforge.net/lists/listinfo/sleuthkit-users</a><br> \
>>>> <a href="http://www.sleuthkit.org" \
target="_blank">http://www.sleuthkit.org</a><br> >>><br>
>>><br>
>><br>
>> ------------------------------------------------------------------------------<br>
>> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For \
FREE<br> >> Instantly run your Selenium tests across 300+ browser/OS \
combos.<br> >> Get unparalleled scalability from the best Selenium testing \
platform available<br> >> Simple to use. Nothing to install. Get started now \
for free."<br> >> <a \
href="http://p.sf.net/sfu/SauceLabs_______________________________________________" \
target="_blank">http://p.sf.net/sfu/SauceLabs_______________________________________________</a><br>
>> sleuthkit-users mailing list<br>
>> <a href="https://lists.sourceforge.net/lists/listinfo/sleuthkit-users" \
target="_blank">https://lists.sourceforge.net/lists/listinfo/sleuthkit-users</a><br> \
>> <a href="http://www.sleuthkit.org" \
target="_blank">http://www.sleuthkit.org</a><br> ><br>
><br>
> ------------------------------------------------------------------------------<br>
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE<br>
> Instantly run your Selenium tests across 300+ browser/OS combos.<br>
> Get unparalleled scalability from the best Selenium testing platform \
available<br> > Simple to use. Nothing to install. Get started now for \
free."<br> > <a href="http://p.sf.net/sfu/SauceLabs" \
target="_blank">http://p.sf.net/sfu/SauceLabs</a><br> > \
_______________________________________________<br> > sleuthkit-users mailing \
list<br> > <a href="https://lists.sourceforge.net/lists/listinfo/sleuthkit-users" \
target="_blank">https://lists.sourceforge.net/lists/listinfo/sleuthkit-users</a><br> \
> <a href="http://www.sleuthkit.org" \
target="_blank">http://www.sleuthkit.org</a><br> <br>
</div></div></blockquote></div><br></div>
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
sleuthkit-users mailing list
https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
http://www.sleuthkit.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic