Archive for December, 2008

levenshtein to slow, how to speed it up

Tuesday, December 30th, 2008
for a little project i need to compare a string against a large set of strings. it should not only match the exact strings, it also should match strings which are similar. to find out if two strings are similar there exists an algorithm called “levenshtein”. it takes two strings as an argument and returns the distance between these strings. if the distance is zero the strings are equal. the bigger the distance the more the strings differ.

to slow

i use it to compare strings which are about 200 chars long and there are at the moment 40′000 strings. to compare one string to the existing set i need to call the levenshtein algorithm 40′000 times. because the algorithm itself is not super fast it takes a long time to do the comparison. i took an implementation from Levenshtein: Java Implementation to test and i might get this implementation faster, if i write an optimized version, but i doubt that i get out much. the problem is the 40′000 calls to it.

you don’t need 40′000 levenshtein calls

one attempt to make it faster is to rule out some strings by comparing lengths first. if you say your maximum distance for two strings to match is 10 then you can discard strings which difference in character count is bigger than 10. it’s already much faster, for the 40′000 and for my usage almost fast enough but the faster the better. in this case i still have to check one strings length against 40′000 other strings length. this can be speed up by putting the sets of the strings with the same length into an array where the index is the length of the sets strings. so i don’t have to compare it against all strings. if l is the length of the string i want to test then i need to test it only against the strings in the sets for the indexes with a length from l-10 to l+10. probably i can rule out with this technique even more strings, by example count the number of words in the strings instead of the number of characters or the number of vowels. these approaches could be combined together and it probably will speed it up another bit. but due to statistics the result would probably be about the same like in the case where i use the total length of the string.

how much faster is it?

i measured the time by feeling and don’t know how fast it is exactly. with the string length approach i needed to test only about 85% of the strings against each other. this is faster but still 6′000 hits. if the set of strings to test against gets bigger i’ll soon have the same problem. if i include the other two approaches or similar one i might get it maybe to 90% or even 95% but this still wont help if the set is getting ten or a hundred times bigger.

a better approach?

a better approach would probably be to do the test against all strings in one go. for this the levenshtein algorithm has to be rewritten and i don’t know if it’s even possible. because i’m not a mathematician i might have problem with this approach but i’ll try it and post it here if it works.

install ubuntu from usb stick

Sunday, December 28th, 2008
i bought me a “Gigabyte GA-GC230D, Atom 230″ motherboard some weeks ago and now i finally had time to install linux on it. i was a bit surprised how much of a torture this was. there are lots of quite specific howtos but still it took me hours of trying. in the end it was quite easy, but you have to know how. here is a list of sources i used but none of them did it by itself. live usb pendrive persistent
installation from usb stick
how to install ubuntu on usb bar

preparing the usb stick

at first, i created a single partition, there is usually already one on an usb stick. it has to be at least the size of two cds. then i formated it like: sudo mkfs -t vfat /dev/sdx1 whereas /dev/sdx1 is the partition of the usb stick. be careful not to format accidentally another partition if yo have serial-ata or scsi disks. i accidentally formated my swap space :-) you can find out your usb device by typing: sudo fdisk -l

copy the files

you need to get an iso image of an install cd. i got the ubuntu 8.10 server image. after downloading i created a directory, mounted the iso to this directory and copied all the files to the usb stick. it is probably not necessary to copy all files to the stick but i was to lazy to test whats exactly necessary. the path of the usb stick was in my case /media/disk. mkdir ubuntuImage mount -o loop /path/to/iso-image ubuntuImage cd ubuntuImage cp -Rf * /media/disk cp -Rf .disk /media/disk cp -Rf isolinux /media/disk/syslinux cd /media/disk/syslinux mv isolinux.cfg syslinux.cfg thats it, the files are on the stick. during the installation there was a problem copying files from the stick. i solved it by making a copy of /media/disk/dists/intrepid to stable. on the cd there was a symbolic link to stable, this is not possible on a fat filesystem. cp -R /media/disk/dists/intrepid /media/disk/dists/stable to “fix” another problem occurring later, copy the whole iso image to the stick too.

make the drive bootable

to install the bootloader you need a command called syslinux. it does some magic to the usb stick. to install it type: sudo apt-get install syslinux mtools if your usb stick is mounted, unmount it. use sudo syslinux /dev/sdx1 to finally install the bootloader. to be sure your stick has a proper master boot record use: install-mbr /dev/sdx

booting from the stick

in the bios i had to activate an option called “legacy USB storage detect” and select USB-ZIP as boot device. after that ubuntu booted and the installer started. the first problem occurred when it tried to load the cd. it just wasn’t able to do this. with alt-f2 you can switch to the console and mount the “cdrom” manually by typing mount -t vfat /dev/sdx1 /cdrom go back to the installer with alt-f1, try the failed step again and it should now work. after setting up network and disk there will occur another error. when trying to install the base system a message “Failed to determine codename for the release” will appear. go back to the install menu and select “load installer components from cd”. select the iso option and it will find the image and the installation should continue without problems.

formatter for debugging json, xml and more

Wednesday, December 17th, 2008
very often when i am developing applications i have to work with external services. sometimes the communication is in xml, sometimes it is in json and sometimes it is a very curious nonstandard format. xml and json both have the advantage of being human readable. but very often this advantage is destroyed by using horribly formatted xml and json streams. this makes it much harder to debug. often i end up with a large chunk of xml or json with no linebreaks in it. instead of installing an appplication who can properly format that chunk you can also use online tools. it is very helpful if you can copy/paste the data and don’t have to create a new file, paste it and open it. here is a list of a few:

json formatting

  • JSON Formatter very powerfull, great userinterface, also validates input, customizable
  • JSON Format it gets a bit slow if you use really large json strings

xml formatter

havent found a good one yet.

urlencoder/decoder

if you provide your data as an url parameter it might be helpfull if you can easyli decode it.

html formatter

usually not used for communication but can be handy if you need to extract content of a website.

SQL Formatter

found it, so it is here.

mysql auto_increment with multi valued key

Sunday, December 14th, 2008
a few weeks ago i had to create a mysql database where a table had a primary key which consisted of a foreign key and an index. i did a first try with a table like this: CREATE TABLE testTable ( ref bigint not null, idx bigint not null auto_increment, PRIMARY KEY (ref, idx) ) ENGINE=MyISAM; ref was a reference to another table so that was given and idx was to differentiate between versions. one requirement was that idx should start for each different ref with 1 and count up. but i forgot about that requirement and did it like above. i remembered the forgotten requirement a few days later when the database was already in use. i was quite surprised when i saw that all worked properly, like it should. i expected that idx was different for every single row, but, it was only different for rows with the same ref. a quite an interesting behavior for an auto_increment column i thought and consulted the mysql manual. i found out that if an auto_increment column is a secondary value of an index the behavior is like this. if the last row of such a table is deleted it is reused later. but it works only if the auto_increment column is not the primary column in another index. in this case this index would generate the auto_increment values and it would be unique over all rows.

md5 hash function

Sunday, December 7th, 2008
md5 is a cryptographic hash funtion. the input is a some data, a text string or whatever. the output is a fixed length byte array. it is impossible to reconstruct the original data from the output data. if the input gets changed just a tiny bit the output is completely different. java supports multible hash algorithms, one of the not so secure but often used is md5. to create an md5 hash of a string you need about following code: import java.security.MessageDigest; .... // here we store the output as a hex encoded string StringBuilder builder = new StringBuilder(); MessageDigest md5 = MessageDigest.getInstance("MD5"); // we use the string "HELLO" as input md5.update("HELLO".getBytes()); // now we have to iterate over the output bytearray for (byte b : md5.digest()) { // make an int out of the byte. the // & 0xff is to remove the sign bit int i = b & 0xff; if (i < 0x10) { // if it's less than 16, add a 0 for padding builder.append("0"); } builder.append(Integer.toHexString(i)); } // print the finished hex string System.out.println(builder.toString()); the output should be eb61eead90e3b899c6bcbe27ac581660 if you need only a single hash you can use an online hash calculator. this one supports not only md5, he also supports older, less secure algorithms and newer more secure algorithms.