deja vu: code, culture, and QA

Some years ago I had the privilege of making some suggestions for Brian Marick's book Everyday Scripting based on the first article I ever wrote for Better Software magazine. That article appeared in 2004, and I just recently ran into a similar situation at work.

Wikipedia is localized for well over 100 languages. I had only been working at Wikimedia Foundation a couple of weeks when I heard that discrepancies between the localized message files from version to version could cause problems when upgrading. I didn't know what kind of problems, but since we're upgrading all the Wikipedia wikis to version 1.19, that sounded like sort of a big deal, so I followed up.

It turns out that changes to the localization files are essentially undocumented, no tools exist to monitor such changes, and we simply did not know anything about discrepancies in those files. So I decided it would be useful to look into that.

You can find the Wikipedia localization files for version 1.19 here and for version 1.18 here if you want to follow along.

Since there are well over 100 files in each directory and each file has 1000s of lines, checking for discrepancies manually is impossible. From one of the senior people on the Wikimedia dev staff I got a few examples of certain places in these files where discrepancies would cause big problems. (See technical note at the end.) Although I've cleaned the code up quite a bit (one-off scripts don't have to be DRY, right?) here's what I did to cite discrepancies for one of the examples:

In a directory called 'mediawiki' I have one directory 'lang118' and another 'lang119'. In those directories are all of the Messages*.php files for each version. What I want to do is read each file in each version, identify the contents of the $namespaceNames array, and compare those contents for every file in each directory.

path119 = 'mediawiki/lang119/'
path118 = 'mediawiki/lang118/'

r119namespaceNames_array = []
r118namespaceNames_array = []

def get_values ( path, array_name  ) 
  Dir.foreach(path) do |name|
  unless File.directory?("#{path}#{name}")
    text = File.read("#{path}#{name}")
    text.scan(/namespaceNames.+?\)/m)
    array_name << name + $~.to_s
         end #unless
  end #do
end

get_values(path119, r119namespaceNames_array)
get_values(path118, r118namespaceNames_array)

mismatch = r119namespaceNames_array - r118namespaceNames_array
disc= mismatch.length.to_s
puts "number of files with discrepancies in $namespaceNames array is #{disc}"

mismatch.each do |string|
  file = string.split(".php")
  puts file[0]
end

This script runs from the directory above 'mediawiki'. It defines the paths to where the localization files live, and defines two arrays to hold the values to be compared. For each directory it calls the 'get_values' method, and puts the name of the file and the contents of the $namespaceNames array of that file into the appropriate array. Subtracting one array from the other yields a set of all mismatches, and with that the script knows how many files have mismatches, and what the names of those files are.

Reading this script should be fairly straightforward for anyone who knows a little bit of Ruby. Note a few things, though:

* 'unless' is equivalent to 'if not', and the script needs to not check directories, only files
* File.read is the same as Perl's "slurp", it puts the entire contents of the file into the variable 'text'
* the 'scan' method takes a regular expression for an argument. Here the regular expression is saying "give me all the text that begins with the string 'namespaceNames' and ends with the string ')'. I had forgotten that '.+' is 'greedy', and will match past the terminating string, so doing '.+?' prevents that, thanks Charley Baker for the reminder. The 'm' at the end of the regex tells it to match multiple lines, which is necessary because each value of the $namespaceNames array is on a single line and I want to match all of them in one fell swoop.

The output from this script looks like

number of files with discrepancies in $namespaceNames array is 16
MessagesEn_ca
MessagesEn_rtl
MessagesFrp
MessagesIg
MessagesMk
MessagesMzn
MessagesNb
MessagesNds_nl
MessagesNo
MessagesOr
MessagesOs
MessagesQug
MessagesSa
MessagesSr_ec
MessagesWar
MessagesYue

At this point it made sense to just look at the problem files with my eyeballs and see what was in their $namespaceNames arrays. With a little help from diff(), that's what I did. I reported the discrepancies I found on a public mail list for Wikimedia tech issues.

A couple of interesting things happened because of that. Again, keep in mind that I am a total n00b with these systems. While I have a little more information now, I had no idea of what the consequences of such discrepancies would be.

I got an answer on the mail list from a senior Wikimedia dev person who analyzed the discrepancies I reported and said in effect "everything's fine, we are good to upgrade based on these examples". And while there are several other areas in these localization files that could cause issues, my example demonstrates that the technical risk for upgrading to 1.19 seems low.

But then some days later in a a conversation on IRC, a different senior Wikimedia dev person said in effect "whoa, whoa, whoa, if we release these changes without at least some review from the language communities affected, we are going to be in for big trouble".

As I write this I do not know if the localization files for Wikipedia will be upgraded next week or not; that decision is not in my hands. However, I am immensely pleased that as a total n00b I was able to provide true concrete examples of the data in question to inform that decision.

I decided to write about this for a number of reasons:

To my mind, nothing in this story has anything to do with "testing". For some time now I have been saying that "QA is not evil", and to me, this was an exercise in pure Software Quality Assurance. Since my official title at Wikimedia is "QA Lead", this makes me happier than you would imagine.

One of the great neglected areas of software projects is the state of the actual data in applications, be it held in files or databases or whatever. One of the most important skills QA/testing people can bring to bear on a software project is the ability to isolate critical chunks of data from enormous data stores. That was true when I wrote "Is Your Haystack Missing a Needle" in 2004, it was true when Brian published "Everyday Scripting" in 2007 and it remains true today. If as a QA/testing person you don't know how to read a bunch of files and do regular expressions (and for that matter do SQL queries too), you owe it to yourself and to your projects to learn. (Frankly, I hadn't done this kind of thing in a long, long time, and it felt great to get back on that horse.)

Finally, I wrote this because all of the data and all of the conversations we had were completely open and public. I could give you a link to the email thread where I published the detailed discrepancies and got the reply, I could publish a link to the IRC log where people discussed the cultural risks of upgrading the localization files. The only reason I don't is because they're not germane to the story. I so enjoy working in an open culture.

Technical notes:

My original script checked for discrepancies among four arrays: $namespaceNames, $namespaceAliases, $magicWords, and $specialPageAliases. The $magicWords array was trickier, and I had to do this:

text = File.read("#{@@path118}#{name}")
text.scan(/magicWords.+?\);/m)
if $~.to_s.length > 0
array = $~.to_s
array_no_space = array.gsub(/\s+/,"")
@@nsn118magicWords_array << name + array_no_space

For one thing, $magicWords is an array-of-arrays, so I check for a terminating string of ');' instead of just ')'. For another thing, some of the files didn't contain the $magicWords array. For another thing, I found some random differences in whitespace between versions for many many files, so I eliminated all the whitespace in the strings in question by doing 'array.gsub(/\s+/,"")'. The comparison only became valid once those things happened.

Chris McMahon's Blog

Search This Blog

deja vu: code, culture, and QA