[prev in list] [next in list] [prev in thread] [next in thread] List: tortoisesvn-dev Subject: Statsdlg - first patch: data gathering upgrade From: "Andreas Nicolai" <Andreas.Nicolai () gmx ! net> Date: 2007-10-07 15:37:36 Message-ID: op.tzt20ycjt8lo91 () helium [Download RAW message or body] Hi there, while I'm hacking away on the stats dialog, I created (+attached) the first patch that includes the reworking of the stats data gathering algorithm. The patch only affects the files: StatGraphDlg.h and StatGraphDlg.cpp and is created against revision 10908. Here's a brief review of the code changes: 1. week count: old: The previous implementation took the first date and the last date in the array as time span. new: The new implementation searches for min and max dates, then aligns the earliest date with a date at the begin of the corresponding week, then this date is stored in a new member variable m_minDate. 2. data gathering: old: the previous implementation was implemented such that a lot of binary searches (using lower_bound) were executed for _each_ commit. This caused the noticable delay when opening the stats dialog for large number of revisions (e.g. try "Show all" in the TSVN repository and open the stats dialog). Also, reoccuring weeks due to later import of revision histories would be treated as new weeks and thus not giving the correct stats. new: The new implementation loops over all weeks in the intervals determined in GetWeekCount() and stores for each week/interval the number of commits and file changes per author, it also keeps track of the total commit count and total file change count. At the same time the commits for each author are stored in a mapping. Then a list of author names is created and the list is sorted based on commit count. For that purpose I wrote a binary predicate class MoreCommitsThan to be able to compare authors based on their commit count. As a result, all the sorting during the data gathering is no longer necessary and the time expensive CountCommits() function can be removed alltogether. Further, the required stats are obtained for the min/max author (first and last in the sorted list) and the dialogs statistics can be shown. I documented the new code fairly detailed so it shouldn't be too hard to follow (I hope). Just one thing I noted... Because of the aligning to begin/end of the week, revision intervals that start in the middle of a week and end in the middle of the week may actually be reported as one week longer than the time span actually is. However, if I don't align the interval with the start of the week, the weekly interval may actuall start on a Wednesday and last until next weeks Tuesday. For a different revision range (maybe including the previous 200 revs) the interval may be between Friday and next weeks Thursday. This, however, results in different min/max commit and file changes counts. So I guess I don't get around the aligning part, and for the improved data gathering algorithm I need the m_minDate. Design questions: 1. The data structures created in the ShowStats() dialog need to be used in the other statistics functions as well. Re-gathering the data would be a waste of time, so I would propose making these variables member variables of the dialog that get populated when the dialog is first shown. All other statistics views can then use the information and obtain/calculate specific other data. Would that make sense having these mappings and lists as member variables? 2. The maps for the commit and file change data is currently of type: map<int, map<stdstring, LONG> > so that data can be accessed by: LONG commits = commitsPerAuthorAndWeek[week_nr][author_name]; However, the memory needed for storing the data could be reduced if instead of strings the authors would be identified by a number that and the name/number connection is made via yet another mapping. So, the statement above would look like: LONG commits = commitsPerAuthorAndWeek[week_nr][authorNumber[author_name]]; Since the memory footprint of the statistics dialog is rather low compared to the log dialog, I would probably postpone this upgrade until later. Also, it would hurt readibility of the code, so I'd prefer the way data is stored now. What are your thoughts on this? Bye, Andreas -- Andreas Nicolai anicolai@syr.edu PhD Candidate, M.A.M.E (315) 443-2641 Syracuse University 151 Link Hall Syracuse, NY, 13244 ["statsdlg_improvement_1.zip" (statsdlg_improvement_1.zip)] PK uYG7EFcY <