r - Filter data frame to retain rows that meet certain criteria -
i have following data frame trying filter. want retain rows @ least 1 value in row greater 0.5. appreciated. tried following system hangs:
gbpre.mat<-as.matrix(gbpre) ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5)) ph1544_pre ph1545_pre ph1565_pre ph1571_pre ph1612_pre ph1616_pre bg00050873 0.88235087 0.6053853 0.6521263 0.2770632 0.82596713 0.635325831 bg00212031 0.01175069 0.1844859 0.4345596 0.2186097 0.03717635 0.670305781 bg00213748 0.64571987 0.7316865 0.4345596 0.5613724 0.81309068 0.900878028 bg00214611 0.04405524 0.7103071 0.6810916 0.6526317 0.03412550 0.008187867 bg00455876 0.72122206 0.1272784 0.2155168 0.4794622 0.70089805 0.668497074 bg01707559 0.03592823 0.3548602 0.2743443 0.2194279 0.57761264 0.061564411
the reason definition of ind not work in function apply, not using argument of function, rather whole of gbpre. if matrix large, might slow, because each of many rows of matrix entire large matrix checked.
to more specific: definition:
ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5)) you use apply on rows, fine. define function of 1 argument. argument called gbpre.mat, possible, recommend don't use same name variable want pass function. avoid confusion. function not use gbpre.mat, result of function independent of it's input. not want.
so should rather use following:
ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5)) this works, thelatemail has suggested faster. let me show example. first, create large sample matrix:
set.seed(1435) gbpre.mat <- matrix(runif(600000,0,0.7), ncol = 6) head(gbpre.mat) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 0.34588950 0.548891207 0.14621109 0.64827636 0.2132974880 0.08318449 ## [2,] 0.08258421 0.504511182 0.15966061 0.65975977 0.0009340659 0.18353030 ## [3,] 0.01970881 0.004321273 0.51373098 0.58779409 0.1166218414 0.55205101 ## [4,] 0.16150403 0.134012891 0.19062268 0.68766140 0.4341565775 0.46083298 ## [5,] 0.32099279 0.371436278 0.13317573 0.02674299 0.4670175053 0.47581938 ## [6,] 0.50144544 0.579256903 0.03034916 0.56547615 0.0091638700 0.42943656 and use both ways rows, @ least 1 number larger 0.5 , measure time:
system.time(ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5))) ## user system elapsed ## 0.218 0.008 0.228 system.time(ind2 <- rowsums(gbpre.mat > 0.5) > 0) ## user system elapsed ## 0.008 0.000 0.008 there clear winner here. results identical:
identical(ind, ind2) ## [1] true i want add clarification on why code slow. let me run definition of ind on first 600 rows of matrix:
system.time(ind3 <- apply(gbpre.mat[1:600, ], 1, function(gb) any(gbpre.mat > 0.5))) ## user system elapsed ## 3.011 0.461 3.479 you see use whole matrix gbpre.mat inside function. running on 600 lines takes 3.5 seconds, calculation entire matrix take 1 hour. , wrong: vector of true only, because checked many times whether there single value larger 0.5 somewhere in entire matrix.
Comments
Post a Comment