Linear regression is a basic tool. It works on the assumption that there exists a linear relationship between the dependent and independent variable, also known as the explanatory variables and output. However, not all problems have such a linear relationship. In fact, many of the problems we see today are nonlinear in nature. A very basic example is our own decision making process which involves deciding an outcome based on various questions. For example, when we decide to have dinner, our thought process is not linear. It is based a combination of our tastes, our budget, our past experiences with a restaurant, alternatives available, weather conditions etc. There can be other simple nonlinear cases such as quadratic or exponential dependencies which are not too difficult to imagine. This is how non-linear regression came into practice – a powerful alternative to linear regression for nonlinear situations. Similar to linear regression, nonlinear regression draws a line through the set of available data points in such a way that the line fits to the data with the only difference that the line is not a straight line or in other words, not linear.

Want to know more about the latest trends in data? On November 25th-26th 2019, Data Natives conference brings together a global community of data-driven pioneers. Get your ticket now at a discounted Early Bird price!

Non-linear Regression – An Illustration

In R, we have lm() function for linear regression while nonlinear regression is supported by nls() function which is an abbreviation for nonlinear least squares function. To apply nonlinear regression, it is very important to know the relationship between the variables. Looking at the data, one should be able to determine the generalized equation of the model which will fit the data. This model is then specified as the ‘formula’ parameter in nls() function. The function then determines the coefficients of the parameters in the model. Let’s try linear and nonlinear regression models on an exponential data. I will use the runif() function to generate an exponential set of values for y. Here I will use x as a sequence from 0 to 100.

I will also use a set.seed() function so that the values are regenerated for you.

#set a seed value

set.seed(23)

#Generate x as 100 integers using seq function

x<-seq(0,100,1)

#Generate y as a*e^(bx)+c

y<-runif(1,0,20)*exp(runif(1,0.005,0.075)*x)+runif(101,0,5)

#How does our data look like? Lets plot it

plot(x,y)

This seems a fairly smooth non-linear plot. To illustrate the difference between linear and nonlinear models, let’s fit them both:

#Linear model

lin_mod=lm(y~x)

#Plotting the model

plot(x,y)

abline(lin_mod)

There is little overlap between the actual values and the fitted plot. Now let’s try the nonlinear model and specify the formula

nonlin_mod=nls(y~a*exp(b*x),start=list(a=13,b=0.1)) #a is the starting value and b is the exponential start

#This new plot can be made by using the lines() function

plot(x,y)

lines(x,predict(nonlin_mod),col=”red”)

This is a much better fit and clearly passes through most of the data. For more clarity, we will now calculate the errors for both the models

#Error calculation

error <- lin_mod\$residuals

lm_error <- sqrt(mean(error^2))   #5.960544

error2=y-predict(nonlin_mod)

nlm_error <- sqrt(mean(error2^2)) #1.527064

The linear model has more than twice the error than that of nonlinear one. This shows that the nonlinear model fits better for nonlinear data.

Understanding the nls() function

There are a few parameters that the nls() function requires. I used two parameters to define the model in the above illustration – the formula and the start parameters. Nonlinear function requires us to look at the data first and estimate the model to fit in. This estimated model is specified as the formula parameter. We can also specify the coefficients as variables to be estimated. The next step involves specifying the start parameter. This parameter specifies the starting values of the coefficients we used in the formula. Here we have ‘a’ and ‘b’ as the coefficients. I took ‘a’ as the nearest integer to minimum value of y (which is approximately 13.19) and ‘b’ as the increment for the exponent. Using these values, the nls() function determines the optimal values of ‘a’ and ‘b’. It is very important to set the right starting parameter values otherwise the model may give us absurd results or even fail. Let’s see what are the estimated values of ‘a’ and ‘b’ for this dataset:

nonlin_mod

Nonlinear regression model

model: y ~ a * exp(b * x)

data: parent.frame()

a        b

13.60391  0.01911

residual sum-of-squares: 235.5

Number of iterations to convergence: 15

Achieved convergence tolerance: 4.975e-07

The values for ‘a’ and ‘b’ estimated for this model are 13.60391 and 0.01911 respectively which are very close to those we provided as starting values. This shows that that the model estimated by the nls() function is y=13.60391*e^(0.01911*x). Further, the estimated values of ‘a’ and ‘b’ are very close to the starting values we provided. These results will remain the same if we keep ‘b’ as 0.01 or even 0.001 or keep ‘a’ as 10 or 100 or 1000. As long as the model is able to converge at the optimal estimation, some approximation is admissible. However, if the values of ‘a’ and ‘b’ are completely out of range, say 1 and 1, we get an error as the model fails. The right set of starting values need to be estimated by looking at the data before implementing the model.

Self-Starting Functions

The problem arises when one is beginning with nonlinear functions and does not know what value should be estimated for the parameters. To illustrate this problem, I will now use a non-linear dataset available in R. The Puromycin data shows the concentration and reaction rate for enzymatic reaction of Puromycin antibiotic. I will plot the data to understand the data and estimate the formula equation

attach(Puromycin)

plot(Puromycin\$conc,Puromycin\$rate)

This data is specific to biological reactions and can be estimated using the famous enzyme kinetics equation known as the Michaelis-Menten equation. For this, we will separate the dataset based on whether the state is “treated” or “untreated” and define a function for the equation

#Define a function to apply Michaelis-Menten equation

mm=function(conc,vmax,k) vmax*conc/(k+conc)

#Use the nls data over the first subset of treated data. I will set the starting values as 50 and 0.05

mm1=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”treated”)

#Use a similar function for the second subset of untreated data

mm2=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”untreated”)

Both the models, mm1 and mm2 make good estimations of the data and fit the model. However, it is hard to estimate the starting values looking at the plot of Puromycin conc. vs rate. The Puromycin concentration vs rate plot suggested that the minimum conc. on the x- axis is around 0.01 and the maximum rate (vmax) on the y-axis is around 200 yet I purposely used values which are very different from these estimations  so that the model will fit while converging slowly. In this case, I am taking a risk on the estimation ability of the model. This is where “self-starting” functions come into the picture. As the name suggests, a self-starting function does not need a starting value for the parameters and do estimate themselves. We can rewrite the above two functions using the SSmicmen function which is a self starting function for Michaelis-Menten equation. The new models are:

mm3=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”treated”)

mm4=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”untreated”)

Let us compare the corresponding models by calling the model variables in R. We will first look at the models with state=”treated” which are mm1 and mm3 and compare the vmax and k values. We will then compare the models with state=”untreated” which are mm2 and mm4.:

#Print the model summary and estimated parameters for mm1

mm1

Nonlinear regression model

model: rate ~ mm(conc, vmax, k)

data: Puromycin

vmax         k

212.68369   0.06412

residual sum-of-squares: 1195

Number of iterations to convergence: 7

Achieved convergence tolerance: 2.703e-06

#Print the model summary and estimated parameters for mm3

mm3

Nonlinear regression model

model: rate ~ SSmicmen(conc, vmax, k)

data: Puromycin

vmax         k

212.68371   0.06412

residual sum-of-squares: 1195

Number of iterations to convergence: 0

Achieved convergence tolerance: 1.937e-06

#Print the model summary and estimated parameters for mm2

mm2

Nonlinear regression model

model: rate ~ mm(conc, vmax, k)

data: Puromycin

vmax         k

160.28001   0.04771

residual sum-of-squares: 859.6

Number of iterations to convergence: 7

Achieved convergence tolerance: 2.039e-06

#Print the model summary and estimated parameters for mm4

mm4

Nonlinear regression model

model: rate ~ SSmicmen(conc, vmax, k)

data: Puromycin

vmax         k

160.28012   0.04771

residual sum-of-squares: 859.6

Number of iterations to convergence: 5

Achieved convergence tolerance: 3.942e-06

The corresponding models have estimated the same coefficients up to the third decimal. This shows that self-starting functions fairly well in place of functions where I need to define the start parameters. The big limitation of estimating the starting parameters can be avoided using the self-starting functions. R has many self-starting functions available. A list of the same can be obtained by using the apropos function:

apropos(“^SS”)

 “SSasymp”     “SSasympOff”  “SSasympOrig” “SSbiexp”     “SSD”

 “SSfol”       “SSfpl”       “SSgompertz”  “SSlogis”     “SSmicmen”

 “SSweibull”

With the exception of the SSD function, there are 10 self-starting functions here in R. The final step in the model. I have given a brief description of what all these functions are defined for (in alphabetical order)

1. SSasymp       asymptotic regression models
1. SSasympOff    asymptotic regression models with an offset
1. SSasympOrig   asymptotic regression models through the origin
1. SSbiexp       biexponential models
1. SSfol         first-order compartment models
1. SSfpl         four-parameter logistic models
1. SSgompertz    Gompertz growth models
1. SSlogis       logistic models
1. SSmicmen      Michaelis–Menten models
1. SSweibull     Weibull growth curve models

Goodness of Fit

As an additional verification step, I will also check the goodness of fit of the model. This can be done by looking that the correlation between the values predicted by the model and the actual y values.

#Goodness of fit for first nonlinear function

cor(y,predict(nonlin_mod)) #0.9976462

#Goodness of fit for treated values of Puromycin function

cor(subset(Puromycin\$rate,state==”treated”),predict(mm3)) #0.9817072

cor(subset(Puromycin\$rate,state==”treated”),predict(mm1)) #0.9817072

#Goodness of fit for untreated values of Puromycin function

cor(subset(Puromycin\$rate,state==”untreated”),predict(mm2)) #0.9699776

cor(subset(Puromycin\$rate,state==”untreated”),predict(mm4)) #0.9699777

All our models have a high correlation value which indicates that the values are very close to each other and accurate. The corresponding model summary and estimation parameters also show the same observation.

Summary

Regression is a fundamental technique to estimate the relationships among variables and nonlinear regression is a handy technique if that relationship is nonlinear. It is similar to linear regression and provides a powerful method to fit a nonlinear curve based on the estimated formula while minimizing the error using nonlinear least squares method. There are a variety of other nonlinear models available such as SVM and Decision trees. Nonlinear regression is a robust technique over such models because it provides a parametric equation to explain the data. As the models becomes complex, nonlinear regression becomes less accurate over the data. This article gives an overview of the basics of nonlinear regression and understand the concepts by application of the concepts in R. Here is the complete R code used in the article.

#set a seed value

set.seed(23)

#Generate x as 100 integers using seq function

x<-seq(0,100,1)

#Generate y as a*e^(bx)+c

y<-runif(1,0,20)*exp(runif(1,0.005,0.075)*x)+runif(101,0,5)

#How does our data look like? Lets plot it

plot(x,y)

#Linear model

lin_mod=lm(y~x)

#Plotting the model

plot(x,y)

abline(lin_mod)

nonlin_mod=nls(y~a*exp(b*x),start=list(a=13,b=0.1)) #a is the starting value and b is the exponential start

#This new plot can be made by using the lines() function

plot(x,y)

lines(x,predict(nonlin_mod),col=”red”)

#Error calculation

error <- lin_mod\$residuals

lm_error <- sqrt(mean(error^2))   #5.960544

error2=y-predict(nonlin_mod)

nlm_error <- sqrt(mean(error2^2)) #1.527064

nonlin_mod

attach(Puromycin)

plot(Puromycin\$conc,Puromycin\$rate)

#Define a function to apply Michaelis-Menten equation

mm=function(conc,vmax,k) vmax*conc/(k+conc)

#Use the nls data over the first subset of treated data. I will set the starting values as 50 and 0.05

mm1=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”treated”)

#Use a similar function for the second subset of untreated data

mm2=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”untreated”)

mm3=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”treated”)

mm4=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”untreated”)

#Print the model summary and estimated parameters for mm1

mm1

#Print the model summary and estimated parameters for mm3

mm3

#Print the model summary and estimated parameters for mm2

mm2

#Print the model summary and estimated parameters for mm4

mm4

#Print the names of all functions in R which start with SS

apropos(“^SS”)

#Goodness of fit for first nonlinear function

cor(y,predict(nonlin_mod)) #0.9976462

#Goodness of fit for treated values of Puromycin function

cor(subset(Puromycin\$rate,state==”treated”),predict(mm3)) #0.9817072

cor(subset(Puromycin\$rate,state==”treated”),predict(mm1)) #0.9817072

#Goodness of fit for untreated values of Puromycin function

cor(subset(Puromycin\$rate,state==”untreated”),predict(mm2)) #0.9699776

cor(subset(Puromycin\$rate,state==”untreated”),predict(mm4)) #0.9699777