r - Filter data frame to retain rows that meet certain criteria -


i have following data frame trying filter. want retain rows @ least 1 value in row greater 0.5. appreciated. tried following system hangs:

gbpre.mat<-as.matrix(gbpre) ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5))  ph1544_pre ph1545_pre ph1565_pre ph1571_pre ph1612_pre  ph1616_pre bg00050873 0.88235087  0.6053853  0.6521263  0.2770632 0.82596713 0.635325831 bg00212031 0.01175069  0.1844859  0.4345596  0.2186097 0.03717635 0.670305781 bg00213748 0.64571987  0.7316865  0.4345596  0.5613724 0.81309068 0.900878028 bg00214611 0.04405524  0.7103071  0.6810916  0.6526317 0.03412550 0.008187867 bg00455876 0.72122206  0.1272784  0.2155168  0.4794622 0.70089805 0.668497074 bg01707559 0.03592823  0.3548602  0.2743443  0.2194279 0.57761264 0.061564411 

the reason definition of ind not work in function apply, not using argument of function, rather whole of gbpre. if matrix large, might slow, because each of many rows of matrix entire large matrix checked.

to more specific: definition:

ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5)) 

you use apply on rows, fine. define function of 1 argument. argument called gbpre.mat, possible, recommend don't use same name variable want pass function. avoid confusion. function not use gbpre.mat, result of function independent of it's input. not want.

so should rather use following:

ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5)) 

this works, thelatemail has suggested faster. let me show example. first, create large sample matrix:

set.seed(1435) gbpre.mat <- matrix(runif(600000,0,0.7), ncol = 6) head(gbpre.mat) ##            [,1]        [,2]       [,3]       [,4]         [,5]       [,6] ## [1,] 0.34588950 0.548891207 0.14621109 0.64827636 0.2132974880 0.08318449 ## [2,] 0.08258421 0.504511182 0.15966061 0.65975977 0.0009340659 0.18353030 ## [3,] 0.01970881 0.004321273 0.51373098 0.58779409 0.1166218414 0.55205101 ## [4,] 0.16150403 0.134012891 0.19062268 0.68766140 0.4341565775 0.46083298 ## [5,] 0.32099279 0.371436278 0.13317573 0.02674299 0.4670175053 0.47581938 ## [6,] 0.50144544 0.579256903 0.03034916 0.56547615 0.0091638700 0.42943656 

and use both ways rows, @ least 1 number larger 0.5 , measure time:

system.time(ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5))) ##    user  system elapsed  ##   0.218   0.008   0.228  system.time(ind2 <- rowsums(gbpre.mat > 0.5) > 0) ##    user  system elapsed  ##   0.008   0.000   0.008 

there clear winner here. results identical:

identical(ind, ind2) ## [1] true 

i want add clarification on why code slow. let me run definition of ind on first 600 rows of matrix:

system.time(ind3 <- apply(gbpre.mat[1:600, ], 1, function(gb) any(gbpre.mat > 0.5))) ##    user  system elapsed  ##   3.011   0.461   3.479  

you see use whole matrix gbpre.mat inside function. running on 600 lines takes 3.5 seconds, calculation entire matrix take 1 hour. , wrong: vector of true only, because checked many times whether there single value larger 0.5 somewhere in entire matrix.


Comments

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -