R Fork Bomb

So maybe I’m a strange guy, but I think fork bombs are really funny.  What’s a fork bomb?  The basic premise is that you spawn a process that spawns a process that spawns a process…, ad infinitum.

The most beautiful example of a fork bomb, and really one of the most beautiful lines of code ever, was created by Denis Rolo:

:(){ :|:& };:

Aside from looking like the gnarliest smiley face ever, running the above in a Unix terminal will spawn processes forever, unless you’ve limited the number of processes that can be run (I think OS X does this by default, so any brave Mac users are encouraged to try).  After a whole lot have been spawned, you’ve basically locked up your system and will have to reboot.

There’s a thorough explanation of how and why the above code works over at the Wikipedia fork bomb page, but I almost feel like knowing how it works ruins the beauty somewhat.  But that page also has lots of examples of fork bombs in various languages.  Unfortunately, R is absent from that list.  Well I think that’s just a shame.

We could invoke the above (with some modifications due to a weird quirk) with the system() function.  But that’s basically just letting the shell have all the fun.  However, we can do one better using the multicore package:

library(multicore)

while (TRUE) fork()

So why would you want to do this?  Well, I think the whole point is that you wouldn’t want to.  It’s just neat.

One final remark on this.  I have a good friend with whom I regularly converse about programming.  He’s a hardcore C programmer, and as such, he finds my struggles with efficiency in R hilarious.  When I told him that I was going to make a blog post about an R forkbomb, he had a simple, two word response:

“Slowest forkbomb.”

Who says programmers don’t have a sense of humor?


How Much of R is Written in R Part 2: Contributed Packages

So that mean old boss of mine is at it again.  This morning I came in beaming about how many people had read my post How Much of R is Written in R (thanks by the way!).  He then asks me about one little line in that post; the one about how if you looked at the contributed packages, you’d overwhelmingly see them written in R, and that this is what really matters.

He asked if I had bothered to verify this.  I had not.  So off he sent me to do this too.  Can you believe the nerve of this character?  Making me do work.  I’m paid by a university; I’m not supposed to do work!

So think of this post as the quick cash-in sequel that no one probably wanted–the “Transformers 2” of R-bloggers posts, if you will.  I don’t mind admitting that I had to use every ounce of constraint to not call this “… Part 2:  Electric Boogaloo”.

This task is a little more cumbersome than the last one.  Unfortunately this means I have to actually be slightly careful in my shellscript, rather than just playing it fast and loose and making all the hardcore programmers angry by being so sloppy (spoiler alert:  they will still be angry).  I won’t go into the boring details about how I generated the data.  Let’s just have a look at the results; though, as always, all code and generated output will be attached.

Oh, but before we go on, if you thought Part 1 was interesting, you should go check out  http://www.ohloh.net/p/rproject.  Someone linked it in the comments of Part 1, and it’s definitely impressive. It has most of that information from that blog post and then some, and in a much cooler graph.  From the drop down menu, select “Languages”.  The graph is interactive, and if you (for instance) click on C (to hide it), then the graph rescales.  Pretty cool!

Ok, so the results then.

We will again first look at the Language breakdown of the number of total files:

Here R is the clear victor by overwhelming majority.  But the real challenge is looking at the language breakdown by percentage contribution to total lines of code (for our purposes here, we are looking at all ~3000 contributed packages and lumping them together as one huge mass of sourcecode).  The following graph provides that, but first a word of caution.  It’s a bit difficult to tell when a .h, or even a .c file is a C or C++ (in fact, you can fool the Unix/Linux tool “file” into believing these are all kinds of crazy things).  As such, some choice had to be made (see the script if you’re interested).  Errors likely occurred in determining what is C code and what is C++ code.  That said, having made our choices, we have the following breakdown:

Here R is the clear victor, though in light of the doubt cast above, it is perhaps not as strong of a victory as we would hope.  To assuage all doubt, we can lump C and C++ together into one category:

And look at that; R is still winning.  Now, you probably see where this is headed:

Which is pretty darn good!  In short, the people who are outside of the core R team but who are still developing incredibly cool things with R…are making those things in R.  And why shouldn’t they?  R rocks!

Here’s all the boring crap nobody cares about:

First, just a quick heads up; all the sourcecode to follow will look much better on my blog than on R-bloggers, because I have everything formatted (with highlighted syntax) specially in WordPress.  So if you want to be able to read this, then I would recommend reading it directly from my blog.

The shell script to grab all the packages and generate the “csv” (output is tab delimited (because some screwey files have commas in them), changed to .xls because wordpress hates freedom, split into two pieces because xls is a dumb format:  piece 1    piece 2):

#!/bin/sh

outdir="wherever/"

wget -e robots=off -r -l1 --no-parent -A.tar.gz http://cran.r-project.org/src/contrib/

srcdir="cran.r-project.org/src/contrib"
cd $srcdir

echo -e "\n\n          This will take a LONG FREAKING TIME.\n          Please wait paitently.\n\n"

touch $outdir/r_contrib_loc_by_lang.csv

#echo "Language,File.Name,loc,Proj.Name,Proj.ID.Nmbr" >> $outdir/r_contrib_loc_by_lang.csv
echo -e "Language\tFile.Name\tloc\tProj.Name" >> $outdir/r_contrib_loc_by_lang.csv

for archive in `ls *.tar.gz`;do
  tar -zxf $archive
done

for found in `find . -name \*.r -or -name \*.R -or -name \*.pl -or -name \*.c -or -name \*.C -or -name \*.cpp -or -name \*.cc -or -name \*.h -or -name \*.hpp -or -name \*.f`; do
  loc=`wc -l $found | awk '{ print $1 }'`
  filename=`echo $found | sed -n 's:^\(.*/\)\([^/]*$\):\2:p'`
  proj=`echo $found | sed -e 's/.\///' -e 's/\/.*//'`
  lang=`echo $filename | sed 's/.*[.]//'`

  if [ $lang = "r" ]; then
    lang="R"
  elif [ $lang = "pl" ]; then
    lang="Perl"
  elif [ $lang = "C" ]; then
    lang="c"
  elif [ $lang = "cpp" ]; then
    lang="c++"
  elif [ $lang = "h" ]; then
    lang="c"
  elif [ $lang = "hpp" ]; then
    lang="c++"
  elif [ $lang = "f" ]; then
    lang="Fortran"
  elif [ $lang = "cc" ]; then
    # Use file for best guess; bad guesses we revert to c
    lang=`file $found | awk '{ print $3 }'`
    if [ $lang = "English" ] || [ $lang = "Unicode" ]; then
      lang="c"
    fi
  fi
  
  echo -e "$lang\t$filename\t$loc\t$proj"  >> $outdir/r_contrib_loc_by_lang.csv
done

echo -e "\n\n          ALL DONE\n\n"

And here’s the R script used to analyze everything and generate the barplots:

r.loc <- read.delim("r_contrib_loc_by_lang.csv", header=TRUE, stringsAsFactors=FALSE)

a <- r.loc[which(r.loc[1] == "R"), ][3]
b <- r.loc[which(r.loc[1] == "c"), ][3]
c <- r.loc[which(r.loc[1] == "c++"), ][3]
d <- r.loc[which(r.loc[1] == "Fortran"), ][3]
e <- r.loc[which(r.loc[1] == "Perl"), ][3]

lena <- length(a[, 1])
lenb <- length(b[, 1])
lenc <- length(c[, 1])
lend <- length(d[, 1])
lene <- length(e[, 1])

files.total <- lena + lenb + lenc + lend + lene
loc.total <- sum(a) + sum(b) + sum(c) + sum(d) + sum(e)

cat(sprintf("\nNumber R files:\t\t\t %d\nNumber C files:\t\t\t %d\nNumber C++ files:\t\t %d\nNumber Fortran files:\t\t %d\nNumber Perl files:\t\t %d\n", lena, lenb, lenc, lend, lene))
cat(sprintf("--------------------------------------"))
cat(sprintf("\nTotal source files examined:\t %d\n\n", files.total))

cat(sprintf("\nLines R code:\t\t %d\nLines C code:\t\t %d\nLines C++ code:\t\t %d\nLines Fortran code:\t %d\nLines Perl code:\t %d\n", sum(a), sum(b), sum(c), sum(d), sum(e)))
cat(sprintf("--------------------------------"))
cat(sprintf("\nTotal lines of code:\t %d\n\n", loc.total))

cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in C++:\t\t %f\n%% code in Fortran:\t %f\n%% code in Perl:\t\t %f\n", 100*sum(a)/loc.total, 100*sum(b)/loc.total, 100*sum(c)/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total))

png("1_pct_contrib_source_files.png")
barplot(c(100*lena/files.total, 100*lenb/files.total, 100*lenc/files.total, 100*lend/files.total, 100*lene/files.total), main="Percent of Contrib Sourcecode Files By Language", names.arg=c("R","C","C++","Fortran","Perl"), ylim=c(0,70))
dev.off()

png("2_pct_contrib_code.png")
barplot(c(100*sum(a)/loc.total, 100*sum(b)/loc.total, 100*sum(c)/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","C","C++","Fortran","Perl"), ylim=c(0,50))
dev.off()

png("3_pct_contrib_code_c.cpp.combined.png")
barplot(c(100*sum(a)/loc.total, 100*(sum(b)+sum(c))/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","C/C++","Fortran","Perl"), ylim=c(0,50))
dev.off()

png("4_pct_contrib_code_r.vs.all.png")
barplot(c(100*sum(a)/loc.total, 100*(sum(b)+sum(c)+sum(d)+sum(e))/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","Everything Else"), ylim=c(0,60))
dev.off()

With output:

Number R files:			 46054
Number C files:			 9149
Number C++ files:		 9387
Number Fortran files:		 1684
Number Perl files:		 44
--------------------------------------
Total source files examined:	 66318

Lines R code:		 6009966
Lines C code:		 3519637
Lines C++ code:		 2214267
Lines Fortran code:	 645226
Lines Perl code:	 6910
--------------------------------
Total lines of code:	 12396006

% code in R:		 48.483084
% code in C:		 28.393315
% code in C++:		 17.862745
% code in Fortran:	 5.205112
% code in Perl:		 0.055744

How Much of R is Written in R?

My boss sent me an email (on my day off!) asking me just how much of R is written in the R language.  This is very simple if you use R and a Unix-like system.  It also gives me a good excuse to defend the title of this blog.  It’s librestats, not projecteulerstats, afterall.

So I grabbed the R-2.13.1 source package from the cran and wrote up a little script that would look at all .R, .c, and .f files in the archive, record the language (R, C, or Fortran), number of lines of code, and the file the code came from; then it’s just a matter of dumping all that to a csv (converted to .xls (in LibreOffice) because WordPress hates freedom).

We’ll talk in a minute about just how you would generate that csv–but first let’s address the original question.

By a respectable majority, most of the source code files of core R are written in R:

At first glance, it seems like Fortran doesn’t give much of a contribution. However, when we look at the proportion of lines of code, we see something more reasonable:
So there you have it. Roughly 22% of R is written in R. I know some people want R to be written in R for some crazy reason; but really, if anything, that 22% is too high. Trust me, you really want C and Fortran to be doing all the heavy lifting so that things stay nice and peppy.

Besides, this is a fairly irrelevant issue, in my opinion.  What matters is that people outside of Core R are writing in R.  Look at the extra packages repo and you’ll see a very different story from the above graphic.  That’s something SAS certainly can’t say, since people who want to do anything other than call some cookie-cutter SAS proc have to use IML or that ridiculous SAS macro language–each of which is somehow even more of a hilarious mess than base SAS.

Ok, so how do we get that data?  I actually have a much better script  than the one I’m about to describe.  The new one automatically grabs every source package from the cran that you don’t already have and starts digging in on them, dumping everything out into one big csv so you can watch trending.  It’s interesting to see the transition from R being almost entirely (92%) in C to seeing it slowly drop down to ~52%.  But that’s a different post for a different day because I have a few kinks to work out with that script before I would feel comfortable releasing it.

So here’s how this system works.  It’s basically the dumbest possible solution; I’m pretty good at those, if I may say so myself.  Basically the shell script hops into across the R-version/src/ folder and gets a line count of each .R, .c, and .f file.  That’s it; here it is:

#!/bin/sh

outdir="/path/to/where/you/want/the/csv/dumped"

rdir="/path/to/R/source/root/directory/to/be/examined" #eg, ~/R-2.13.1/
cd $rdir/src

for rfile in `find -name *.R`
do
	loc=`wc -l $rfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "R,$loc"  >> $outdir/r_source_loc.csv
done

for cfile in `find -name *.c`
do
	loc=`wc -l $cfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "C,$loc"  >> $outdir/r_source_loc.csv
done

for ffile in `find -name *.f`
do
	loc=`wc -l $ffile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "Fortran,$loc"  >> $outdir/r_source_loc.csv
done

Then the R script just does exactly what you’d think, given the data (take a look at the “csv” for examples).

r.loc <- read.csv("r_source_loc.csv",header=FALSE)

a <-r.loc[which(r.loc[1] == "R"),][2]
b <-r.loc[which(r.loc[1] == "C"),][2]
c <-r.loc[which(r.loc[1] == "Fortran"),][2]

files.total <- length(a[,1])+length(b[,1])+length(c[,1])
loc.total <- sum(a)+sum(b)+sum(c)

cat(sprintf("\nNumber .R source files:\t\t %d\nNumber .c source files:\t\t %d\nNumber .f source files:\t\t %d\n",length(a[,1]),length(b[,1]),length(c[,1])))
cat(sprintf("-------------------------------------"))
cat(sprintf("\nTotal source files examined:\t %d\n\n",length(a[,1])+length(b[,1])+length(c[,1])))

cat(sprintf("\nLines of R code:\t %d\nLines of C code:\t %d\nLines of Fortran code:\t %d\n",sum(a),sum(b),sum(c)))
cat(sprintf("-------------------------------"))
cat(sprintf("\nTotal lines of code:\t %d\n\n",loc.total))

cat(sprintf("\nAmong all lines of code being either R, C, or Fortran:\n"))
cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in Fortran:\t %f\n",100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total))

png("pct_r_source_files.png")
barplot(c(length(a[,1])/files.total,length(b[,1])/files.total,length(c[,1])/files.total),main="Percent of Core R Sourcecode Files",names.arg=c("R","C","Fortran"))
dev.off()

png("pct_r_code.png")
barplot(c(100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total),main="Percent of Core R Lines of Code",names.arg=c("R","C","Fortran"))
dev.off()

From the R script, we can get precise figures, which I prefer to pictures any day. But I seem to be an outlier in this regard…

Number .R source files:		 729
Number .c source files:		 586
Number .f source files:		 45
-------------------------------------
Total source files examined:	 1360

Lines of R code:	 149520
Lines of C code:	 346778
Lines of Fortran code:	 175409
-------------------------------
Total lines of code:	 671707

Among all lines of code being either R, C, or Fortran:
% code in R:		 22.259705
% code in C:		 51.626379
% code in Fortran:	 26.113916


Project Euler in R: Problem 10

Problem:  Find the sum of all the primes below two million.

Commentary:  If you’re not careful, this problem can really get you in R.  In poking around the internet, there are a few solutions to this problem, but all the ones I’ve tested are slow.  Some, even from people who are better at R than me, are VERY slow.  That linked one in particular took 6 minutes and 8.728000 seconds to complete.  My solutions runs in under 2 seconds.

So we’ll be using that old prime algorithm I’ve already discussed to death.  Other than a few tricks in there that I never see get implemented, there’s a crucial trick to this solution.  Since we only want the sum, let’s not be slow about finding the primes we want.  I only have to find all primes below \sqrt{n}, where n=2000000.  It takes basically no time to find them.  From there, I start evaluating my sum, and like a good little R user, I vectorize by getting the possible list of possible primes above \sqrt{n}.  From there, I check if any of my known primes divide into those possible primes.  This is actually much faster than generating primes and storing them.

R Code:

ElapsedTime <- system.time({
##########################
n <- 2000000

### Function to get primes below specified n; option to get only those below sqrt(n)
PrimesBelow <- function(n, below.sqrt=FALSE){
  if (below.sqrt == TRUE){
    m <- ceiling(sqrt(n))
  } else {
    m <- n
  }

  primes <- c(2,3)
  i <- 3
  while (i < m-1){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possible.primes <- primes[primes<s]
      for (prime in possible.primes){
        if ((i%%prime == 0)){
          flag <- 1
          break
        }
      }
      if (flag == 0){
        primes <- c(primes, i)
      }
    }
  }
  primes
}
###

primes <- PrimesBelow(n, below.sqrt=TRUE)

biglist <- ceiling(sqrt(n)):n
biglist <- biglist[-which(biglist%%2 == 0)]
biglist <- biglist[-which(biglist%%6 == 3)]

for (prime in primes){
  biglist <- biglist[which(biglist%%prime != 0)]
}

answer <- sum(primes)+sum(as.numeric(biglist))
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %f\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  142913828922.000000
Total elapsed time:  0 minutes and 1.915000 seconds


Project Euler in R: Problem 9

Problem:  There exists exactly one Pythagorean triplet for which a + b + c = 1000.  Find the product abc.

Commentary:  I’m not proud of this solution.  Every time I would look at this solution, I just knew there was something really obvious I was missing–that I was doing this in the most bone-headed way possible.  So then I finally just gave up and checked the writeup at Project Euler, and yep, I forgot that Pythagorean Triplets have a very nice parametrization.

I won’t even bother to reproduce the idea here, because the writeup over at Project Euler is actually very well done.  But it is a bit embarrassing to have a masters in number theory and to forget that simple fact.  Oh well.

R Code:

ElapsedTime <- system.time({
##########################
for (a in c(1:333)){
  for (b in c(334:998)){
    test <- a+b+sqrt(a^2+b^2)
    if (test==1000){
      win.a <- a
      win.b <- b
      break
    }
  }
}

answer <- win.a*win.b*sqrt(win.a^2+win.b^2)
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:

The answer is:  31875000
Total elapsed time:  0 minutes and 0.460000 seconds


Prime testing function in R

I was hoping to begin tinkering a bit with the multicore package in R beyond some extremely trivial examples.  Thanks to a combination of R’s dumb quirkiness (for example, being worthless on loops), my poor planning, and general bad programming, my Saturday afternoon tinkering project is ultimately worthless in fulfilling that purpose.

I was really hoping to take the prime searcher I had been using to solve Project Euler problems and use it to make a prime testing function which would utilize the 4 cores on my adorable little core i5.  I succeeded in creating it, but it wasn’t until I had finished (which thankfully was only about an hour of time wasted) that I realized that my plan was stupid and I was going to have to loop over entries (or use sapply(), which is the same thing; don’t kid yourself).  This, as we all know, is where R is fantastically bad.

A very hasty test showed that relying on a single core, the original way I was doing things is a little over 2 times faster than this new implementation.  This doesn’t bode well for the possible speedup when scaling up to 4 cores, and since I have more pressing projects at the moment, I’m going to have to dump this one barring some illumination.

However, it is a workable prime tester.  And for testing against single primes, it’s honestly not that slow.  But, as pointed out above, if you’re using it to test primality of a (reasonably large) range of (reasonably large) values, then you’re going to have to loop and R is going to catch fire in the process.

The syntax is simple enough; simply conjure your integer of choice n and ask R

IsPrime(n)

which is like the old Maple syntax back when I used it extensively (which was ~ Maple 8 or 9, I believe).  Functions and sample output below:

IsPrime <- function(n){ # n=Integer you want to know if is/not prime
  if ((n-floor(n)) > 0){
    cat(sprintf("Error: function only accepts natural number inputs\n"))
  } else if (n < 1){
      cat(sprintf("Error: function only accepts natural number inputs\n"))
    } else
  # Prime list exists
  if (try(is.vector(primes), silent=TRUE) == TRUE){

    # Prime list is already big enough
    if (n %in% primes){
      TRUE
    } else
    if (n < tail(primes,1)){
      FALSE
    } else
    if (n <= (tail(primes,1))^2){
      flag <- 0
      for (prime in primes){
        if (n%%prime == 0){
          flag <- 1
          break
        }
      }
    
      if (flag == 0){
        TRUE
      }
      else {
        FALSE
      }
    }

    # Prime list is too small; get more primes
    else {
      last.known <- tail(primes,1)
      while ((last.known)^2 < n){
        assign("primes", c(primes,GetNextPrime(primes)), envir=.GlobalEnv)
        last.known <- tail(primes,1)
      }
      IsPrime(n)
    }
  } else {
    # Prime list does not exist
    assign("primes", PrimesBelow(n,below.sqrt=TRUE), envir=.GlobalEnv)
    IsPrime(n)
  }
}

# Get next prime
GetNextPrime <- function(primes){ # primes=Known prime list
  i <- tail(primes,1)
  while (TRUE){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possible.primes <- primes[primes<s]
      for (prime in possible.primes){
        if ((i%%prime == 0)){
          flag <- 1
          break
        }
      }
      if (flag == 0){
        break
      }
    }
  }
  i
}

# Primes below specified integer n; optionally only those below sqrt(n)
PrimesBelow <- function(n, below.sqrt=FALSE){
  if (below.sqrt == TRUE){
    m <- ceiling(sqrt(n))
  } else {
    m <- n
  }
  
  primes <- c(2,3)
  i <- 3
  while (i < m-1){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possible.primes <- primes[primes<s]
      for (prime in possible.primes){
        if ((i%%prime == 0)){
          flag <- 1
          break
        }
      }
      if (flag == 0){
        primes <- c(primes, i)
      }
    }
  }
  primes
}

Some sample output:

 > primes
Error: object 'primes' not found
> IsPrime(100)
[1] FALSE
> IsPrime(101)
[1] TRUE
> primes
[1]  2  3  5  7 11
>

After screwing around with the prime tester for a few minutes, I was able to find this adorable little gem

1> IsPrime(1234567891)
[1] TRUE

So 1234567891 is prime, but none of 1, 12, 123, 1234, 12345, 123456, 1234567, 12345678, 123456789, 12345678910, 123456789101, or 1234567891011 are (of course, nearly half of those are even, but the point stands).


Project Euler in R: Problem 8

Problem:  Find the greatest product of five consecutive digits in the 1000-digit number.

Commentary:  So this solution is probably a little non-standard.  I’ve actually got a huge blog post coming up about the little “trick” I use here, so I won’t go into it at length just yet. But before we go into the “trick”, I want to explain why it is that I’m doing the weird crap that I’m doing–afterwards we’ll talk a little about the “what”.

My philosophy on Project Euler problems is that they should be taken “as is”.  If I’m given a text file (as is the case in some problems), then I’d better figure out how to read in that text file.  Likewise, if I’m given a text dump (as is the case here), then no changes can be made by hand.  None at all.  I suspect this is how most people proceed.  However, after solving a problem in Project Euler, I enjoy looking at other people’s solutions, and in doing so, I have found some people who admit to doing some truly crazy things.  Things like adding commas after characters by hand in a huge text dump.  Don’t do that!

So in this problem, the goal is to copy/paste the text dump in a script and sandwich it somewhere between some usable code.  I do this by turning it into a vector in R using Unix sorcery.  Once you buy that this is what I’ve done, the rest is trivial.  Just check every 5 consecutive digits, but keep only the biggest.

So how did I get that thing into a vector?  You might argue that it was by cheating.  I couldn’t disagree more (more on this at the conclusion of the commentary).  I used the function system(), which on a Linux (and presumably Mac OS X, *bsd, …) system is a very powerful tool.  It’s essentially a wrapper for the very powerful Unix (or Unix-like) shell which lives under your operating system and makes everything awesome.  I don’t know what it does on Windows because I don’t give a shit.

So here, I pass off to the shell an echo which contains some unwanted (additional) line breaks, together with the text dump from Project Euler.  Next, I pipe off to tr and sed to do their voodoo.  In short, this is how they work here:

  • tr -d ‘\n’ removes line breaks
  • sed ‘s/\\([0-9]\\)/\\1\\n/g’ adds a line break after every numeral
  • sed ‘s/.$//’ removes last line break

Two additional things:  first, the -e’s allow me to chain together a bunch of sed calls at once.  This is handy because sed is the Stream EDitor, which is exactly what it sounds like–and chaining them like that (as opposed to piping the output to another sed call) is faster because (here) you only have to read in the stream once.  Second, you might notice that in the tr call, I use \n for line breaks, but in the sed call I use \\n (and \\( instead of \(, \\1 instead of \1, etc).  The reason is that R’s system() function is really weird, and requires extra escaping of characters…sometimes.

Ok, so that unpacks most of it.  The last bit is the intern=TRUE part (default FALSE) captures the output and stores it into a character vector (as indicated in the help file).  So the last little leg of sorcery is to use as.numeric to use the numbers for my intended purpose.  That’s it; show’s over.  Go home.

So why wasn’t this cheating, in my opinion?  It’s a wrapper that’s built into core R–it’s not even an extra package I have to download.  In my mind, this is very clearly in the “not cheating” category, but I have some other solutions that, whenever I get the time to post the writeups, blur the line a bit–especially when I start invoking bc.

If you want to argue that there’s a “more R-y” way to do it, then I’m willing to concede (even though I don’t know how–I just drop down to the shell to do my dirty work and have never been left wanting).  If you think it violates the spirit of Project Euler, then I suggest you take a look at some of the Python kids who practically import entire solutions.

Besides, sed is really, really cool.  Deal with it, nerd.

R Code:

ElapsedTime <- system.time({
##########################
big <- as.numeric(system("echo '
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
' | tr -d '\n' | sed -e 's/\\([0-9]\\)/\\1\\n/g' -e 's/.$//'",intern=TRUE))

big.prod <- 1

for (i in c(1:(length(big)-5))){
  test <- prod(big[i:(i+4)])
  if (test > big.prod){
    big.prod <- test
  }
}

answer <- big.prod
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",
answer, ElapsedMins, ElapsedSecs))

Output:

The answer is:  40824
Total elapsed time:  0 minutes and 0.014000 seconds


Project Euler in R: Problem 7

Problem:  What is the 10001st prime number?

Commentary:  So it’s finally time to discuss my prime-searching algorithm, as promised back in Problem 5.  We will find primes in the following way:

  1. 2 and 3 are prime
  2. We examine the odd integers (having already discovered the only even prime) starting with i=5.
  3. First, check if i is of the form 6k+3 for some integer k>0.  If so, then i can’t be prime, because then i=6k+3=3(2k+1), and since k>0, 2k+1 > 1 (so i\neq 3).  In this case, don’t even bother proceeding; replace i with i+2 and start over.
  4. Calculate s(i)=\sqrt{i}.  Among all known primes below s(i)  (going no further for the same reason as that outlined in Problem 3), check if these primes divide into i.  If we find one that does, then there is no reason to proceed; i is not prime.  Go back to 3. and try again with i replced with i+2.  If none of the known primes below s(i) divide i, then stop; i must be prime.  Add it to the list of known primes.
  5. Repeat this process by replacing i with i+2 until we have 10001 primes.

It’s nothing fancy, but it works.  As for the implementation, believe it or not, I actually don’t construct a list of length 10001 and insert into it as I gather my primes like a good little R user.  See the post-code commentary for an explanation.

R Code:

ElapsedTime <- system.time({
##########################
primes <- c(2, 3)
i <- 3
while (length(primes) < 10001){
  flag <- 0
  i <- i+2
  if (i%%6 == 3){
    flag <- 1
  }
  if (flag == 0){
    s <- sqrt(i)+1
    for (prime in primes){
      if ((i%%prime == 0)){
        flag <- 1
        break
      }
      if (prime > s){
        break
      }
    }
    if (flag == 0){
      primes <- c(primes, i)
    }
  }
}

answer <- tail(primes, 1)
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",
answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  104743
Total elapsed time:  0 minutes and 1.191000 seconds

Post-Code Commentary:  You may notice that I’m not pre-allocating the prime list and then storing new prime values as I go, like a good little R user.  The reason is that it’s actually slower to do it this way.  Now maybe I’m an idiot and I don’t really understand how R does things, but if I take this algorithm and only modify the prime storage part of it to include preallocating the list and then injecting into that list (replacing a 0) as I go, then the time is nearly a full order of magnitude slower.  Here’s the R code that you would produce if you were to do what I just described:

primes <- numeric(10001)
primes[1:2] <- c(2, 3)
i <- 3
prime.position <- 3

while (primes[10001] == 0){
  flag <- 0
  i <- i+2
  if (i%%6 == 3){
    flag <- 1
  }
  if (flag == 0){
    s <- sqrt(i)+1
    for (prime in primes[primes > 0]){
      if ((i%%prime == 0)){
        flag <- 1
        break
      }
      if (prime > s){
        break
      }
    }
    if (flag == 0){
      primes[prime.position] <- i
      prime.position<-prime.position+1
    }
  }
}

answer <- tail(primes, 1)

So how do the two run?  Well, on my work computer (which is a little slower than my home computer–also don’t tell my boss!), the original “bad R user” code runs in 1.852 seconds.  The “better” solution with pre-fetching the vector runs in 11.100 seconds.  That is a 9.248 second differential, which amounts to a nearly 500% increase in run time!

Of course, there are plenty of times when it’s objectively better to pre-allocate; but as it turns out, this isn’t one of them (though I have no idea why).


Project Euler in R: Problem 6

Problem:  Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

Commentary:  I taught a variety of freshman math courses at my university for roughly 5 years.  As such, this problem is very near and dear to my heart.

I can’t tell you how many kids show up at university thinking they’re hot shits because they took Calc 1 in high school, but don’t know that

a^2+b^2\neq (a+b)^2

This is just one of many adorable things freshmen tend to belive, along with all their ridiculous political beliefs and thinking they’ll somehow change the world in any appreciable way.  Adorable.

Actually, students thinking that the sum of the squares is equal to the square of the sum is so common, that it has its own name.  We call it the “freshman dream”.  I’ve also heard it called “Baby’s Binomial Theorem”, which is way funnier, but much less commonly heard.  And since I guess I have to mention it, there are number-like structures out there where the freshman dream is true (fields of characteristic 2 for my algebros), but the real numbers certainly aren’t one of them.

What’s especially funny about this misunderstanding is that it’s literally almost never true (for real numbers).  Pick your favorite two numbers–they don’t even have to be integers.  They can even be the same number.  Unless one (or both) of them is zero, then tada, the sum of the squares isn’t equal to the square of the sum.

So what other crazy (mathematical) things do university students tend to come up with in freshman math class?  Just on the order of basic arithmetic, it’s almost shocking how little mastery they have over the subject.  And we’re not talking about “complicated” things like the limit, integration, or any other worthless calculus crap.  This is arithmetic.  Our university students don’t understand arithmetic.

Just off the top of my head, these are some of the most common misunderstandings, not counting that mentioned above.  Keep in mind, everything that follows is not true, and what’s more, they’re not true for practically any numbers you throw at them (try it yourself!)

\sqrt{a^2+b^2}=a+b

-(a+b)=-a+b

\frac{a}{c}+\frac{b}{d} = \frac{a+b}{c+d}

\frac{a+b}{a}=b

\frac{a}{a+b}=\frac{1}{b}

And that’s not even getting into logarithms or trigonometry, which none of them understand.

This has a lot to do with why I gave up teaching for the cushy office life.  I haven’t looked back since.

R Code:

ElapsedTime<-system.time({
##########################
answer <- abs(sum((1:100)^2)-sum((1:100))^2)
##########################
})[3]
ElapsedMins<-floor(ElapsedTime/60)
ElapsedSecs<-(ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  25164150
Total elapsed time:  0 minutes and 0.000000 seconds


A Math Degree in the Kitchen

So today I was making some rice, and I decided based on calorie load that I wanted more than 1/3 cup of rice, but less than 1/2.  I decided that 5/12 cup was acceptable, but my only tools were a 1/4 cup scoop and a 1/3 cup scoop.

I didn’t have to think very hard about this; I instantly knew it was possible to get 5/12 cup of rice with these tools, because 4 and 3 are relatively prime–i.e., they share no common factors.  More on this in a minute, but first, how to do it.

It’s a very uncomplicated idea, and it should look very similar to the Diehard 3 water puzzle.  In the film, John McClain has a 3 liter jug and a 5 liter jug, and wants to get exactly 4 liters of water.  My rice puzzle is basically the same game.    To get my 5/12 cup of rice, I fill the 1/3 cup scoop and pour that rice into my rice cooker.  Then I again fill the 1/3 scoop and pour that (without spilling) into the 1/4 scoop.  How much remains in the 1/3 scoop?  Well, I had 4/12 cup of rice in it, and I took out 3/12 cup of rice.  So only 1 remains in the 1/3 cup scoop.  Add the 1/12 to the rice cooker, and since 1/3+1/12=5/12, we’re done.  And although that’s how I did it, I actually didn’t even need to get the rice cooker involved.  I can form 5/12 cups inside the scoops without having to use another vessel (though the thing holding all my rice is allowed).  I could have filled the 1/3 cup scoop, poured that into the 1/4 cup scoop, emptied the 1/4 cup scoop, poured the 1/12 cup in the 1/3 cup scoop into the 1/4 cup scoop, then filled the 1/3 cup scoop.  Tada!

I warned you that it was very uncomplicated; the above is really not very interesting on any level.  What is interesting is why this works, and how I knew so quickly that it must be possible to do it.  If you don’t know why, then all that stuff I just wrote might seem like a cute curiosity, when really there is a very basic underlying principle involved.  Instead of needing to ask “I wonder if it’s possible to get 1/6 cups of rice with those tools”, I can instantly say that the answer is “yes”.  I didn’t have to think about it at all.  Here’s how it works.

Instead of thinking of the scoops as 1/4 and 1/3 cup, I’ll do you a favor and eliminate the fractions for you.  Let’s think of them as 3 and 4 serving scoops (out of 12), respectively.  Believe it or not, we’re almost done.  As it turns out, if the greatest common divisor (written gcd) of these two values (in this case 3 and 4 from 1/4 cup and 1/3 cup scoops, respectively) is 1, then it’s possible to get any of 1/12, 2/12, 3/12, 4/12, 5/12, 6/12, or 7/12 cups of rice (maxing out at 3+4=7, because that’s the maximum amount of stuff our two scoops can hold).  Said another way, I can get 1, 2, 3, 4, 5, 6, or 7 servings (out of 12) in this fashion.  Why?  Because if gcd(3,4)=1 (and thank goodness that’s the case), then there exist integers x and y satisfying

3x+4y=1

But that isn’t what I wanted!  I wanted something like 3a+4b=5.  But this is easy.  Multiply the above equation on both sides by 5 and we get

5(3x+4y)=3(5x)+4(5y)=5

So just call 5x=a and 5y=b.  So it must be possible.

And people say that pure math has no practical application.