Jelajahi Sumber

Adding SE-ResNext and ResNext / PyT

Przemek Strzelczyk 6 tahun lalu
induk
melakukan
5562ab767a
93 mengubah file dengan 3490 tambahan dan 692 penghapusan
  1. 5 0
      PyTorch/Classification/ConvNets/Dockerfile
  2. 0 0
      PyTorch/Classification/ConvNets/LICENSE
  3. 1000 0
      PyTorch/Classification/ConvNets/LOC_synset_mapping.json
  4. 51 0
      PyTorch/Classification/ConvNets/README.md
  5. 42 0
      PyTorch/Classification/ConvNets/checkpoint2model.py
  6. 94 0
      PyTorch/Classification/ConvNets/classify.py
  7. 20 0
      PyTorch/Classification/ConvNets/image_classification/__init__.py
  8. 99 17
      PyTorch/Classification/ConvNets/image_classification/dataloaders.py
  9. 29 2
      PyTorch/Classification/ConvNets/image_classification/logger.py
  10. 13 0
      PyTorch/Classification/ConvNets/image_classification/mixup.py
  11. 354 0
      PyTorch/Classification/ConvNets/image_classification/resnet.py
  12. 13 1
      PyTorch/Classification/ConvNets/image_classification/smoothing.py
  13. 35 6
      PyTorch/Classification/ConvNets/image_classification/training.py
  14. 29 0
      PyTorch/Classification/ConvNets/image_classification/utils.py
  15. 0 0
      PyTorch/Classification/ConvNets/img/.gitkeep
  16. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/img/ACCvsFLOPS.png
  17. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/img/LATvsTHR.png
  18. 52 18
      PyTorch/Classification/ConvNets/main.py
  19. 71 0
      PyTorch/Classification/ConvNets/multiproc.py
  20. 602 0
      PyTorch/Classification/ConvNets/resnet50v1.5/README.md
  21. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/resnet50v1.5/img/loss_plot.png
  22. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/resnet50v1.5/img/top1_plot.png
  23. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/resnet50v1.5/img/top5_plot.png
  24. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_250E.sh
  25. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_50E.sh
  26. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_90E.sh
  27. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP16_250E.sh
  28. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP16_50E.sh
  29. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP16_90E.sh
  30. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_250E.sh
  31. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_50E.sh
  32. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_90E.sh
  33. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_250E.sh
  34. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_50E.sh
  35. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_90E.sh
  36. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP16_250E.sh
  37. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP16_50E.sh
  38. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP16_90E.sh
  39. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_250E.sh
  40. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_50E.sh
  41. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_90E.sh
  42. 476 0
      PyTorch/Classification/ConvNets/resnext101-32x4d/README.md
  43. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/resnext101-32x4d/img/ResNeXtArch.png
  44. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/resnext101-32x4d/img/loss_plot.png
  45. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/resnext101-32x4d/img/top1_plot.png
  46. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/resnext101-32x4d/img/top5_plot.png
  47. 1 0
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh
  48. 1 0
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_90E.sh
  49. 1 0
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_250E.sh
  50. 1 0
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_90E.sh
  51. 476 0
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md
  52. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/SEArch.png
  53. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/loss_plot.png
  54. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/top1_plot.png
  55. TEMPAT SAMPAH
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/top5_plot.png
  56. 1 0
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh
  57. 1 0
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_90E.sh
  58. 1 0
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_250E.sh
  59. 1 0
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_90E.sh
  60. 0 8
      PyTorch/Classification/RN50v1.5/Dockerfile
  61. 0 311
      PyTorch/Classification/RN50v1.5/README.md
  62. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP16_1GPU.sh
  63. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP16_4GPU.sh
  64. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP16_8GPU.sh
  65. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP16_EVAL.sh
  66. 0 3
      PyTorch/Classification/RN50v1.5/examples/RN50_FP16_INFERENCE_BENCHMARK.sh
  67. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP32_1GPU.sh
  68. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP32_4GPU.sh
  69. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP32_8GPU.sh
  70. 0 4
      PyTorch/Classification/RN50v1.5/examples/RN50_FP32_EVAL.sh
  71. 0 3
      PyTorch/Classification/RN50v1.5/examples/RN50_FP32_INFERENCE_BENCHMARK.sh
  72. 0 7
      PyTorch/Classification/RN50v1.5/image_classification/__init__.py
  73. 0 271
      PyTorch/Classification/RN50v1.5/image_classification/resnet.py
  74. TEMPAT SAMPAH
      PyTorch/Classification/RN50v1.5/img/DGX2_250_loss.png
  75. TEMPAT SAMPAH
      PyTorch/Classification/RN50v1.5/img/DGX2_250_top1.png
  76. TEMPAT SAMPAH
      PyTorch/Classification/RN50v1.5/img/DGX2_250_top5.png
  77. TEMPAT SAMPAH
      PyTorch/Classification/RN50v1.5/img/training_accuracy.png
  78. TEMPAT SAMPAH
      PyTorch/Classification/RN50v1.5/img/training_loss.png
  79. TEMPAT SAMPAH
      PyTorch/Classification/RN50v1.5/img/validation_accuracy.png
  80. 0 0
      PyTorch/Classification/RN50v1.5/resnet50v1.5/README.md
  81. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP16_250E.sh
  82. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP16_50E.sh
  83. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP16_90E.sh
  84. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP32_250E.sh
  85. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP32_50E.sh
  86. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP32_90E.sh
  87. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP16_250E.sh
  88. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP16_50E.sh
  89. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP16_90E.sh
  90. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP32_250E.sh
  91. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP32_50E.sh
  92. 0 1
      PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP32_90E.sh
  93. 3 1
      README.md

+ 5 - 0
PyTorch/Classification/ConvNets/Dockerfile

@@ -0,0 +1,5 @@
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.07-py3
+FROM ${FROM_IMAGE_NAME}
+
+ADD . /workspace/rn50
+WORKDIR /workspace/rn50

+ 0 - 0
PyTorch/Classification/RN50v1.5/LICENSE → PyTorch/Classification/ConvNets/LICENSE


+ 1000 - 0
PyTorch/Classification/ConvNets/LOC_synset_mapping.json

@@ -0,0 +1,1000 @@
+["tench, Tinca tinca",
+"goldfish, Carassius auratus",
+"great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias",
+"tiger shark, Galeocerdo cuvieri",
+"hammerhead, hammerhead shark",
+"electric ray, crampfish, numbfish, torpedo",
+"stingray",
+"cock",
+"hen",
+"ostrich, Struthio camelus",
+"brambling, Fringilla montifringilla",
+"goldfinch, Carduelis carduelis",
+"house finch, linnet, Carpodacus mexicanus",
+"junco, snowbird",
+"indigo bunting, indigo finch, indigo bird, Passerina cyanea",
+"robin, American robin, Turdus migratorius",
+"bulbul",
+"jay",
+"magpie",
+"chickadee",
+"water ouzel, dipper",
+"kite",
+"bald eagle, American eagle, Haliaeetus leucocephalus",
+"vulture",
+"great grey owl, great gray owl, Strix nebulosa",
+"European fire salamander, Salamandra salamandra",
+"common newt, Triturus vulgaris",
+"eft",
+"spotted salamander, Ambystoma maculatum",
+"axolotl, mud puppy, Ambystoma mexicanum",
+"bullfrog, Rana catesbeiana",
+"tree frog, tree-frog",
+"tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui",
+"loggerhead, loggerhead turtle, Caretta caretta",
+"leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea",
+"mud turtle",
+"terrapin",
+"box turtle, box tortoise",
+"banded gecko",
+"common iguana, iguana, Iguana iguana",
+"American chameleon, anole, Anolis carolinensis",
+"whiptail, whiptail lizard",
+"agama",
+"frilled lizard, Chlamydosaurus kingi",
+"alligator lizard",
+"Gila monster, Heloderma suspectum",
+"green lizard, Lacerta viridis",
+"African chameleon, Chamaeleo chamaeleon",
+"Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis",
+"African crocodile, Nile crocodile, Crocodylus niloticus",
+"American alligator, Alligator mississipiensis",
+"triceratops",
+"thunder snake, worm snake, Carphophis amoenus",
+"ringneck snake, ring-necked snake, ring snake",
+"hognose snake, puff adder, sand viper",
+"green snake, grass snake",
+"king snake, kingsnake",
+"garter snake, grass snake",
+"water snake",
+"vine snake",
+"night snake, Hypsiglena torquata",
+"boa constrictor, Constrictor constrictor",
+"rock python, rock snake, Python sebae",
+"Indian cobra, Naja naja",
+"green mamba",
+"sea snake",
+"horned viper, cerastes, sand viper, horned asp, Cerastes cornutus",
+"diamondback, diamondback rattlesnake, Crotalus adamanteus",
+"sidewinder, horned rattlesnake, Crotalus cerastes",
+"trilobite",
+"harvestman, daddy longlegs, Phalangium opilio",
+"scorpion",
+"black and gold garden spider, Argiope aurantia",
+"barn spider, Araneus cavaticus",
+"garden spider, Aranea diademata",
+"black widow, Latrodectus mactans",
+"tarantula",
+"wolf spider, hunting spider",
+"tick",
+"centipede",
+"black grouse",
+"ptarmigan",
+"ruffed grouse, partridge, Bonasa umbellus",
+"prairie chicken, prairie grouse, prairie fowl",
+"peacock",
+"quail",
+"partridge",
+"African grey, African gray, Psittacus erithacus",
+"macaw",
+"sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita",
+"lorikeet",
+"coucal",
+"bee eater",
+"hornbill",
+"hummingbird",
+"jacamar",
+"toucan",
+"drake",
+"red-breasted merganser, Mergus serrator",
+"goose",
+"black swan, Cygnus atratus",
+"tusker",
+"echidna, spiny anteater, anteater",
+"platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus",
+"wallaby, brush kangaroo",
+"koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus",
+"wombat",
+"jellyfish",
+"sea anemone, anemone",
+"brain coral",
+"flatworm, platyhelminth",
+"nematode, nematode worm, roundworm",
+"conch",
+"snail",
+"slug",
+"sea slug, nudibranch",
+"chiton, coat-of-mail shell, sea cradle, polyplacophore",
+"chambered nautilus, pearly nautilus, nautilus",
+"Dungeness crab, Cancer magister",
+"rock crab, Cancer irroratus",
+"fiddler crab",
+"king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica",
+"American lobster, Northern lobster, Maine lobster, Homarus americanus",
+"spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish",
+"crayfish, crawfish, crawdad, crawdaddy",
+"hermit crab",
+"isopod",
+"white stork, Ciconia ciconia",
+"black stork, Ciconia nigra",
+"spoonbill",
+"flamingo",
+"little blue heron, Egretta caerulea",
+"American egret, great white heron, Egretta albus",
+"bittern",
+"crane",
+"limpkin, Aramus pictus",
+"European gallinule, Porphyrio porphyrio",
+"American coot, marsh hen, mud hen, water hen, Fulica americana",
+"bustard",
+"ruddy turnstone, Arenaria interpres",
+"red-backed sandpiper, dunlin, Erolia alpina",
+"redshank, Tringa totanus",
+"dowitcher",
+"oystercatcher, oyster catcher",
+"pelican",
+"king penguin, Aptenodytes patagonica",
+"albatross, mollymawk",
+"grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus",
+"killer whale, killer, orca, grampus, sea wolf, Orcinus orca",
+"dugong, Dugong dugon",
+"sea lion",
+"Chihuahua",
+"Japanese spaniel",
+"Maltese dog, Maltese terrier, Maltese",
+"Pekinese, Pekingese, Peke",
+"Shih-Tzu",
+"Blenheim spaniel",
+"papillon",
+"toy terrier",
+"Rhodesian ridgeback",
+"Afghan hound, Afghan",
+"basset, basset hound",
+"beagle",
+"bloodhound, sleuthhound",
+"bluetick",
+"black-and-tan coonhound",
+"Walker hound, Walker foxhound",
+"English foxhound",
+"redbone",
+"borzoi, Russian wolfhound",
+"Irish wolfhound",
+"Italian greyhound",
+"whippet",
+"Ibizan hound, Ibizan Podenco",
+"Norwegian elkhound, elkhound",
+"otterhound, otter hound",
+"Saluki, gazelle hound",
+"Scottish deerhound, deerhound",
+"Weimaraner",
+"Staffordshire bullterrier, Staffordshire bull terrier",
+"American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier",
+"Bedlington terrier",
+"Border terrier",
+"Kerry blue terrier",
+"Irish terrier",
+"Norfolk terrier",
+"Norwich terrier",
+"Yorkshire terrier",
+"wire-haired fox terrier",
+"Lakeland terrier",
+"Sealyham terrier, Sealyham",
+"Airedale, Airedale terrier",
+"cairn, cairn terrier",
+"Australian terrier",
+"Dandie Dinmont, Dandie Dinmont terrier",
+"Boston bull, Boston terrier",
+"miniature schnauzer",
+"giant schnauzer",
+"standard schnauzer",
+"Scotch terrier, Scottish terrier, Scottie",
+"Tibetan terrier, chrysanthemum dog",
+"silky terrier, Sydney silky",
+"soft-coated wheaten terrier",
+"West Highland white terrier",
+"Lhasa, Lhasa apso",
+"flat-coated retriever",
+"curly-coated retriever",
+"golden retriever",
+"Labrador retriever",
+"Chesapeake Bay retriever",
+"German short-haired pointer",
+"vizsla, Hungarian pointer",
+"English setter",
+"Irish setter, red setter",
+"Gordon setter",
+"Brittany spaniel",
+"clumber, clumber spaniel",
+"English springer, English springer spaniel",
+"Welsh springer spaniel",
+"cocker spaniel, English cocker spaniel, cocker",
+"Sussex spaniel",
+"Irish water spaniel",
+"kuvasz",
+"schipperke",
+"groenendael",
+"malinois",
+"briard",
+"kelpie",
+"komondor",
+"Old English sheepdog, bobtail",
+"Shetland sheepdog, Shetland sheep dog, Shetland",
+"collie",
+"Border collie",
+"Bouvier des Flandres, Bouviers des Flandres",
+"Rottweiler",
+"German shepherd, German shepherd dog, German police dog, alsatian",
+"Doberman, Doberman pinscher",
+"miniature pinscher",
+"Greater Swiss Mountain dog",
+"Bernese mountain dog",
+"Appenzeller",
+"EntleBucher",
+"boxer",
+"bull mastiff",
+"Tibetan mastiff",
+"French bulldog",
+"Great Dane",
+"Saint Bernard, St Bernard",
+"Eskimo dog, husky",
+"malamute, malemute, Alaskan malamute",
+"Siberian husky",
+"dalmatian, coach dog, carriage dog",
+"affenpinscher, monkey pinscher, monkey dog",
+"basenji",
+"pug, pug-dog",
+"Leonberg",
+"Newfoundland, Newfoundland dog",
+"Great Pyrenees",
+"Samoyed, Samoyede",
+"Pomeranian",
+"chow, chow chow",
+"keeshond",
+"Brabancon griffon",
+"Pembroke, Pembroke Welsh corgi",
+"Cardigan, Cardigan Welsh corgi",
+"toy poodle",
+"miniature poodle",
+"standard poodle",
+"Mexican hairless",
+"timber wolf, grey wolf, gray wolf, Canis lupus",
+"white wolf, Arctic wolf, Canis lupus tundrarum",
+"red wolf, maned wolf, Canis rufus, Canis niger",
+"coyote, prairie wolf, brush wolf, Canis latrans",
+"dingo, warrigal, warragal, Canis dingo",
+"dhole, Cuon alpinus",
+"African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus",
+"hyena, hyaena",
+"red fox, Vulpes vulpes",
+"kit fox, Vulpes macrotis",
+"Arctic fox, white fox, Alopex lagopus",
+"grey fox, gray fox, Urocyon cinereoargenteus",
+"tabby, tabby cat",
+"tiger cat",
+"Persian cat",
+"Siamese cat, Siamese",
+"Egyptian cat",
+"cougar, puma, catamount, mountain lion, painter, panther, Felis concolor",
+"lynx, catamount",
+"leopard, Panthera pardus",
+"snow leopard, ounce, Panthera uncia",
+"jaguar, panther, Panthera onca, Felis onca",
+"lion, king of beasts, Panthera leo",
+"tiger, Panthera tigris",
+"cheetah, chetah, Acinonyx jubatus",
+"brown bear, bruin, Ursus arctos",
+"American black bear, black bear, Ursus americanus, Euarctos americanus",
+"ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus",
+"sloth bear, Melursus ursinus, Ursus ursinus",
+"mongoose",
+"meerkat, mierkat",
+"tiger beetle",
+"ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle",
+"ground beetle, carabid beetle",
+"long-horned beetle, longicorn, longicorn beetle",
+"leaf beetle, chrysomelid",
+"dung beetle",
+"rhinoceros beetle",
+"weevil",
+"fly",
+"bee",
+"ant, emmet, pismire",
+"grasshopper, hopper",
+"cricket",
+"walking stick, walkingstick, stick insect",
+"cockroach, roach",
+"mantis, mantid",
+"cicada, cicala",
+"leafhopper",
+"lacewing, lacewing fly",
+"dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk",
+"damselfly",
+"admiral",
+"ringlet, ringlet butterfly",
+"monarch, monarch butterfly, milkweed butterfly, Danaus plexippus",
+"cabbage butterfly",
+"sulphur butterfly, sulfur butterfly",
+"lycaenid, lycaenid butterfly",
+"starfish, sea star",
+"sea urchin",
+"sea cucumber, holothurian",
+"wood rabbit, cottontail, cottontail rabbit",
+"hare",
+"Angora, Angora rabbit",
+"hamster",
+"porcupine, hedgehog",
+"fox squirrel, eastern fox squirrel, Sciurus niger",
+"marmot",
+"beaver",
+"guinea pig, Cavia cobaya",
+"sorrel",
+"zebra",
+"hog, pig, grunter, squealer, Sus scrofa",
+"wild boar, boar, Sus scrofa",
+"warthog",
+"hippopotamus, hippo, river horse, Hippopotamus amphibius",
+"ox",
+"water buffalo, water ox, Asiatic buffalo, Bubalus bubalis",
+"bison",
+"ram, tup",
+"bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis",
+"ibex, Capra ibex",
+"hartebeest",
+"impala, Aepyceros melampus",
+"gazelle",
+"Arabian camel, dromedary, Camelus dromedarius",
+"llama",
+"weasel",
+"mink",
+"polecat, fitch, foulmart, foumart, Mustela putorius",
+"black-footed ferret, ferret, Mustela nigripes",
+"otter",
+"skunk, polecat, wood pussy",
+"badger",
+"armadillo",
+"three-toed sloth, ai, Bradypus tridactylus",
+"orangutan, orang, orangutang, Pongo pygmaeus",
+"gorilla, Gorilla gorilla",
+"chimpanzee, chimp, Pan troglodytes",
+"gibbon, Hylobates lar",
+"siamang, Hylobates syndactylus, Symphalangus syndactylus",
+"guenon, guenon monkey",
+"patas, hussar monkey, Erythrocebus patas",
+"baboon",
+"macaque",
+"langur",
+"colobus, colobus monkey",
+"proboscis monkey, Nasalis larvatus",
+"marmoset",
+"capuchin, ringtail, Cebus capucinus",
+"howler monkey, howler",
+"titi, titi monkey",
+"spider monkey, Ateles geoffroyi",
+"squirrel monkey, Saimiri sciureus",
+"Madagascar cat, ring-tailed lemur, Lemur catta",
+"indri, indris, Indri indri, Indri brevicaudatus",
+"Indian elephant, Elephas maximus",
+"African elephant, Loxodonta africana",
+"lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens",
+"giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca",
+"barracouta, snoek",
+"eel",
+"coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch",
+"rock beauty, Holocanthus tricolor",
+"anemone fish",
+"sturgeon",
+"gar, garfish, garpike, billfish, Lepisosteus osseus",
+"lionfish",
+"puffer, pufferfish, blowfish, globefish",
+"abacus",
+"abaya",
+"academic gown, academic robe, judge's robe",
+"accordion, piano accordion, squeeze box",
+"acoustic guitar",
+"aircraft carrier, carrier, flattop, attack aircraft carrier",
+"airliner",
+"airship, dirigible",
+"altar",
+"ambulance",
+"amphibian, amphibious vehicle",
+"analog clock",
+"apiary, bee house",
+"apron",
+"ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin",
+"assault rifle, assault gun",
+"backpack, back pack, knapsack, packsack, rucksack, haversack",
+"bakery, bakeshop, bakehouse",
+"balance beam, beam",
+"balloon",
+"ballpoint, ballpoint pen, ballpen, Biro",
+"Band Aid",
+"banjo",
+"bannister, banister, balustrade, balusters, handrail",
+"barbell",
+"barber chair",
+"barbershop",
+"barn",
+"barometer",
+"barrel, cask",
+"barrow, garden cart, lawn cart, wheelbarrow",
+"baseball",
+"basketball",
+"bassinet",
+"bassoon",
+"bathing cap, swimming cap",
+"bath towel",
+"bathtub, bathing tub, bath, tub",
+"beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon",
+"beacon, lighthouse, beacon light, pharos",
+"beaker",
+"bearskin, busby, shako",
+"beer bottle",
+"beer glass",
+"bell cote, bell cot",
+"bib",
+"bicycle-built-for-two, tandem bicycle, tandem",
+"bikini, two-piece",
+"binder, ring-binder",
+"binoculars, field glasses, opera glasses",
+"birdhouse",
+"boathouse",
+"bobsled, bobsleigh, bob",
+"bolo tie, bolo, bola tie, bola",
+"bonnet, poke bonnet",
+"bookcase",
+"bookshop, bookstore, bookstall",
+"bottlecap",
+"bow",
+"bow tie, bow-tie, bowtie",
+"brass, memorial tablet, plaque",
+"brassiere, bra, bandeau",
+"breakwater, groin, groyne, mole, bulwark, seawall, jetty",
+"breastplate, aegis, egis",
+"broom",
+"bucket, pail",
+"buckle",
+"bulletproof vest",
+"bullet train, bullet",
+"butcher shop, meat market",
+"cab, hack, taxi, taxicab",
+"caldron, cauldron",
+"candle, taper, wax light",
+"cannon",
+"canoe",
+"can opener, tin opener",
+"cardigan",
+"car mirror",
+"carousel, carrousel, merry-go-round, roundabout, whirligig",
+"carpenter's kit, tool kit",
+"carton",
+"car wheel",
+"cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM",
+"cassette",
+"cassette player",
+"castle",
+"catamaran",
+"CD player",
+"cello, violoncello",
+"cellular telephone, cellular phone, cellphone, cell, mobile phone",
+"chain",
+"chainlink fence",
+"chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour",
+"chain saw, chainsaw",
+"chest",
+"chiffonier, commode",
+"chime, bell, gong",
+"china cabinet, china closet",
+"Christmas stocking",
+"church, church building",
+"cinema, movie theater, movie theatre, movie house, picture palace",
+"cleaver, meat cleaver, chopper",
+"cliff dwelling",
+"cloak",
+"clog, geta, patten, sabot",
+"cocktail shaker",
+"coffee mug",
+"coffeepot",
+"coil, spiral, volute, whorl, helix",
+"combination lock",
+"computer keyboard, keypad",
+"confectionery, confectionary, candy store",
+"container ship, containership, container vessel",
+"convertible",
+"corkscrew, bottle screw",
+"cornet, horn, trumpet, trump",
+"cowboy boot",
+"cowboy hat, ten-gallon hat",
+"cradle",
+"crane",
+"crash helmet",
+"crate",
+"crib, cot",
+"Crock Pot",
+"croquet ball",
+"crutch",
+"cuirass",
+"dam, dike, dyke",
+"desk",
+"desktop computer",
+"dial telephone, dial phone",
+"diaper, nappy, napkin",
+"digital clock",
+"digital watch",
+"dining table, board",
+"dishrag, dishcloth",
+"dishwasher, dish washer, dishwashing machine",
+"disk brake, disc brake",
+"dock, dockage, docking facility",
+"dogsled, dog sled, dog sleigh",
+"dome",
+"doormat, welcome mat",
+"drilling platform, offshore rig",
+"drum, membranophone, tympan",
+"drumstick",
+"dumbbell",
+"Dutch oven",
+"electric fan, blower",
+"electric guitar",
+"electric locomotive",
+"entertainment center",
+"envelope",
+"espresso maker",
+"face powder",
+"feather boa, boa",
+"file, file cabinet, filing cabinet",
+"fireboat",
+"fire engine, fire truck",
+"fire screen, fireguard",
+"flagpole, flagstaff",
+"flute, transverse flute",
+"folding chair",
+"football helmet",
+"forklift",
+"fountain",
+"fountain pen",
+"four-poster",
+"freight car",
+"French horn, horn",
+"frying pan, frypan, skillet",
+"fur coat",
+"garbage truck, dustcart",
+"gasmask, respirator, gas helmet",
+"gas pump, gasoline pump, petrol pump, island dispenser",
+"goblet",
+"go-kart",
+"golf ball",
+"golfcart, golf cart",
+"gondola",
+"gong, tam-tam",
+"gown",
+"grand piano, grand",
+"greenhouse, nursery, glasshouse",
+"grille, radiator grille",
+"grocery store, grocery, food market, market",
+"guillotine",
+"hair slide",
+"hair spray",
+"half track",
+"hammer",
+"hamper",
+"hand blower, blow dryer, blow drier, hair dryer, hair drier",
+"hand-held computer, hand-held microcomputer",
+"handkerchief, hankie, hanky, hankey",
+"hard disc, hard disk, fixed disk",
+"harmonica, mouth organ, harp, mouth harp",
+"harp",
+"harvester, reaper",
+"hatchet",
+"holster",
+"home theater, home theatre",
+"honeycomb",
+"hook, claw",
+"hoopskirt, crinoline",
+"horizontal bar, high bar",
+"horse cart, horse-cart",
+"hourglass",
+"iPod",
+"iron, smoothing iron",
+"jack-o'-lantern",
+"jean, blue jean, denim",
+"jeep, landrover",
+"jersey, T-shirt, tee shirt",
+"jigsaw puzzle",
+"jinrikisha, ricksha, rickshaw",
+"joystick",
+"kimono",
+"knee pad",
+"knot",
+"lab coat, laboratory coat",
+"ladle",
+"lampshade, lamp shade",
+"laptop, laptop computer",
+"lawn mower, mower",
+"lens cap, lens cover",
+"letter opener, paper knife, paperknife",
+"library",
+"lifeboat",
+"lighter, light, igniter, ignitor",
+"limousine, limo",
+"liner, ocean liner",
+"lipstick, lip rouge",
+"Loafer",
+"lotion",
+"loudspeaker, speaker, speaker unit, loudspeaker system, speaker system",
+"loupe, jeweler's loupe",
+"lumbermill, sawmill",
+"magnetic compass",
+"mailbag, postbag",
+"mailbox, letter box",
+"maillot",
+"maillot, tank suit",
+"manhole cover",
+"maraca",
+"marimba, xylophone",
+"mask",
+"matchstick",
+"maypole",
+"maze, labyrinth",
+"measuring cup",
+"medicine chest, medicine cabinet",
+"megalith, megalithic structure",
+"microphone, mike",
+"microwave, microwave oven",
+"military uniform",
+"milk can",
+"minibus",
+"miniskirt, mini",
+"minivan",
+"missile",
+"mitten",
+"mixing bowl",
+"mobile home, manufactured home",
+"Model T",
+"modem",
+"monastery",
+"monitor",
+"moped",
+"mortar",
+"mortarboard",
+"mosque",
+"mosquito net",
+"motor scooter, scooter",
+"mountain bike, all-terrain bike, off-roader",
+"mountain tent",
+"mouse, computer mouse",
+"mousetrap",
+"moving van",
+"muzzle",
+"nail",
+"neck brace",
+"necklace",
+"nipple",
+"notebook, notebook computer",
+"obelisk",
+"oboe, hautboy, hautbois",
+"ocarina, sweet potato",
+"odometer, hodometer, mileometer, milometer",
+"oil filter",
+"organ, pipe organ",
+"oscilloscope, scope, cathode-ray oscilloscope, CRO",
+"overskirt",
+"oxcart",
+"oxygen mask",
+"packet",
+"paddle, boat paddle",
+"paddlewheel, paddle wheel",
+"padlock",
+"paintbrush",
+"pajama, pyjama, pj's, jammies",
+"palace",
+"panpipe, pandean pipe, syrinx",
+"paper towel",
+"parachute, chute",
+"parallel bars, bars",
+"park bench",
+"parking meter",
+"passenger car, coach, carriage",
+"patio, terrace",
+"pay-phone, pay-station",
+"pedestal, plinth, footstall",
+"pencil box, pencil case",
+"pencil sharpener",
+"perfume, essence",
+"Petri dish",
+"photocopier",
+"pick, plectrum, plectron",
+"pickelhaube",
+"picket fence, paling",
+"pickup, pickup truck",
+"pier",
+"piggy bank, penny bank",
+"pill bottle",
+"pillow",
+"ping-pong ball",
+"pinwheel",
+"pirate, pirate ship",
+"pitcher, ewer",
+"plane, carpenter's plane, woodworking plane",
+"planetarium",
+"plastic bag",
+"plate rack",
+"plow, plough",
+"plunger, plumber's helper",
+"Polaroid camera, Polaroid Land camera",
+"pole",
+"police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria",
+"poncho",
+"pool table, billiard table, snooker table",
+"pop bottle, soda bottle",
+"pot, flowerpot",
+"potter's wheel",
+"power drill",
+"prayer rug, prayer mat",
+"printer",
+"prison, prison house",
+"projectile, missile",
+"projector",
+"puck, hockey puck",
+"punching bag, punch bag, punching ball, punchball",
+"purse",
+"quill, quill pen",
+"quilt, comforter, comfort, puff",
+"racer, race car, racing car",
+"racket, racquet",
+"radiator",
+"radio, wireless",
+"radio telescope, radio reflector",
+"rain barrel",
+"recreational vehicle, RV, R.V.",
+"reel",
+"reflex camera",
+"refrigerator, icebox",
+"remote control, remote",
+"restaurant, eating house, eating place, eatery",
+"revolver, six-gun, six-shooter",
+"rifle",
+"rocking chair, rocker",
+"rotisserie",
+"rubber eraser, rubber, pencil eraser",
+"rugby ball",
+"rule, ruler",
+"running shoe",
+"safe",
+"safety pin",
+"saltshaker, salt shaker",
+"sandal",
+"sarong",
+"sax, saxophone",
+"scabbard",
+"scale, weighing machine",
+"school bus",
+"schooner",
+"scoreboard",
+"screen, CRT screen",
+"screw",
+"screwdriver",
+"seat belt, seatbelt",
+"sewing machine",
+"shield, buckler",
+"shoe shop, shoe-shop, shoe store",
+"shoji",
+"shopping basket",
+"shopping cart",
+"shovel",
+"shower cap",
+"shower curtain",
+"ski",
+"ski mask",
+"sleeping bag",
+"slide rule, slipstick",
+"sliding door",
+"slot, one-armed bandit",
+"snorkel",
+"snowmobile",
+"snowplow, snowplough",
+"soap dispenser",
+"soccer ball",
+"sock",
+"solar dish, solar collector, solar furnace",
+"sombrero",
+"soup bowl",
+"space bar",
+"space heater",
+"space shuttle",
+"spatula",
+"speedboat",
+"spider web, spider's web",
+"spindle",
+"sports car, sport car",
+"spotlight, spot",
+"stage",
+"steam locomotive",
+"steel arch bridge",
+"steel drum",
+"stethoscope",
+"stole",
+"stone wall",
+"stopwatch, stop watch",
+"stove",
+"strainer",
+"streetcar, tram, tramcar, trolley, trolley car",
+"stretcher",
+"studio couch, day bed",
+"stupa, tope",
+"submarine, pigboat, sub, U-boat",
+"suit, suit of clothes",
+"sundial",
+"sunglass",
+"sunglasses, dark glasses, shades",
+"sunscreen, sunblock, sun blocker",
+"suspension bridge",
+"swab, swob, mop",
+"sweatshirt",
+"swimming trunks, bathing trunks",
+"swing",
+"switch, electric switch, electrical switch",
+"syringe",
+"table lamp",
+"tank, army tank, armored combat vehicle, armoured combat vehicle",
+"tape player",
+"teapot",
+"teddy, teddy bear",
+"television, television system",
+"tennis ball",
+"thatch, thatched roof",
+"theater curtain, theatre curtain",
+"thimble",
+"thresher, thrasher, threshing machine",
+"throne",
+"tile roof",
+"toaster",
+"tobacco shop, tobacconist shop, tobacconist",
+"toilet seat",
+"torch",
+"totem pole",
+"tow truck, tow car, wrecker",
+"toyshop",
+"tractor",
+"trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi",
+"tray",
+"trench coat",
+"tricycle, trike, velocipede",
+"trimaran",
+"tripod",
+"triumphal arch",
+"trolleybus, trolley coach, trackless trolley",
+"trombone",
+"tub, vat",
+"turnstile",
+"typewriter keyboard",
+"umbrella",
+"unicycle, monocycle",
+"upright, upright piano",
+"vacuum, vacuum cleaner",
+"vase",
+"vault",
+"velvet",
+"vending machine",
+"vestment",
+"viaduct",
+"violin, fiddle",
+"volleyball",
+"waffle iron",
+"wall clock",
+"wallet, billfold, notecase, pocketbook",
+"wardrobe, closet, press",
+"warplane, military plane",
+"washbasin, handbasin, washbowl, lavabo, wash-hand basin",
+"washer, automatic washer, washing machine",
+"water bottle",
+"water jug",
+"water tower",
+"whiskey jug",
+"whistle",
+"wig",
+"window screen",
+"window shade",
+"Windsor tie",
+"wine bottle",
+"wing",
+"wok",
+"wooden spoon",
+"wool, woolen, woollen",
+"worm fence, snake fence, snake-rail fence, Virginia fence",
+"wreck",
+"yawl",
+"yurt",
+"web site, website, internet site, site",
+"comic book",
+"crossword puzzle, crossword",
+"street sign",
+"traffic light, traffic signal, stoplight",
+"book jacket, dust cover, dust jacket, dust wrapper",
+"menu",
+"plate",
+"guacamole",
+"consomme",
+"hot pot, hotpot",
+"trifle",
+"ice cream, icecream",
+"ice lolly, lolly, lollipop, popsicle",
+"French loaf",
+"bagel, beigel",
+"pretzel",
+"cheeseburger",
+"hotdog, hot dog, red hot",
+"mashed potato",
+"head cabbage",
+"broccoli",
+"cauliflower",
+"zucchini, courgette",
+"spaghetti squash",
+"acorn squash",
+"butternut squash",
+"cucumber, cuke",
+"artichoke, globe artichoke",
+"bell pepper",
+"cardoon",
+"mushroom",
+"Granny Smith",
+"strawberry",
+"orange",
+"lemon",
+"fig",
+"pineapple, ananas",
+"banana",
+"jackfruit, jak, jack",
+"custard apple",
+"pomegranate",
+"hay",
+"carbonara",
+"chocolate sauce, chocolate syrup",
+"dough",
+"meat loaf, meatloaf",
+"pizza, pizza pie",
+"potpie",
+"burrito",
+"red wine",
+"espresso",
+"cup",
+"eggnog",
+"alp",
+"bubble",
+"cliff, drop, drop-off",
+"coral reef",
+"geyser",
+"lakeside, lakeshore",
+"promontory, headland, head, foreland",
+"sandbar, sand bar",
+"seashore, coast, seacoast, sea-coast",
+"valley, vale",
+"volcano",
+"ballplayer, baseball player",
+"groom, bridegroom",
+"scuba diver",
+"rapeseed",
+"daisy",
+"yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum",
+"corn",
+"acorn",
+"hip, rose hip, rosehip",
+"buckeye, horse chestnut, conker",
+"coral fungus",
+"agaric",
+"gyromitra",
+"stinkhorn, carrion fungus",
+"earthstar",
+"hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",
+"bolete",
+"ear, spike, capitulum",
+"toilet tissue, toilet paper, bathroom tissue"]

+ 51 - 0
PyTorch/Classification/ConvNets/README.md

@@ -0,0 +1,51 @@
+# Convolutional Networks for Image Classification in PyTorch
+
+In this repository you will find implementations of various image classification  models.
+
+Detailed information on each model can be found here:
+
+| **Model** | **Link**|
+|:-:|:-:|
+| resnet50 | [README](./resnet50v1.5/README.md) |
+| resnext101-32x4d | [README](./resnext101-32x4d/README.md) |
+| se-resnext101-32x4d | [README](./se-resnext101-32x4d/README.md) |
+
+## Accuracy
+
+
+| **Model** | **AMP Top1** | **AMP Top5** | **FP32 Top1** | **FP32 Top1** |
+|:-:|:-:|:-:|:-:|:-:|
+| resnet50 | 78.46 | 94.15 | 78.50 | 94.11 |
+| resnext101-32x4d | 80.08 | 94.89 | 80.14 | 95.02 |
+| se-resnext101-32x4d | 81.01 | 95.52 | 81.12 | 95.54 |
+
+
+## Training Performance
+
+
+### NVIDIA DGX-1 (8x V100 16G)
+
+| **Model** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** |
+|:-:|:-:|:-:|:-:|
+| resnet50 | 6888.75 img/s | 2945.37 img/s | 2.34x |
+| resnext101-32x4d | 2384.85 img/s | 1116.58 img/s | 2.14x |
+| se-resnext101-32x4d | 2031.17 img/s | 977.45 img/s | 2.08x |
+
+### NVIDIA DGX-2 (16x V100 32G)
+
+| **Model** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** |
+|:-:|:-:|:-:|:-:|
+| resnet50 | 13443.82 img/s | 6263.41 img/s | 2.15x |
+| resnext101-32x4d | 4473.37 img/s | 2261.97 img/s | 1.98x |
+| se-resnext101-32x4d | 3776.03 img/s | 1953.13 img/s | 1.93x |
+
+
+## Model Comparison
+
+### Accuracy vs FLOPS
+![ACCvsFLOPS](./img/ACCvsFLOPS.png)
+
+Dot size indicates number of trainable parameters
+
+### Latency vs Throughput on different batch sizes
+![LATvsTHR](./img/LATvsTHR.png)

+ 42 - 0
PyTorch/Classification/ConvNets/checkpoint2model.py

@@ -0,0 +1,42 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License  (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import torch
+
+
+def add_parser_arguments(parser):
+    parser.add_argument(
+        "--checkpoint-path", metavar="<path>", help="checkpoint filename"
+    )
+    parser.add_argument(
+        "--weight-path", metavar="<path>", help="name of file in which to store weights"
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
+
+    add_parser_arguments(parser)
+    args = parser.parse_args()
+
+    checkpoint = torch.load(args.checkpoint_path)
+
+    model_state_dict = {
+        k[len("module.1.") :] if "module.1." in k else k: v
+        for k, v in checkpoint["state_dict"].items()
+    }
+
+    print(f"Loaded {checkpoint['arch']} : {checkpoint['best_prec1']}")
+
+    torch.save(model_state_dict, args.weight_path)

+ 94 - 0
PyTorch/Classification/ConvNets/classify.py

@@ -0,0 +1,94 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License  (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from PIL import Image
+import argparse
+import numpy as np
+import json
+import torch
+import torch.backends.cudnn as cudnn
+import torchvision.transforms as transforms
+import image_classification.resnet as models
+from image_classification.dataloaders import load_jpeg_from_file
+
+try:
+    from apex.fp16_utils import *
+    from apex import amp
+except ImportError:
+    raise ImportError(
+        "Please install apex from https://www.github.com/nvidia/apex to run this example."
+    )
+
+
+def add_parser_arguments(parser):
+    model_names = models.resnet_versions.keys()
+    model_configs = models.resnet_configs.keys()
+    parser.add_argument("--image-size", default="224", type=int)
+    parser.add_argument(
+        "--arch",
+        "-a",
+        metavar="ARCH",
+        default="resnet50",
+        choices=model_names,
+        help="model architecture: " + " | ".join(model_names) + " (default: resnet50)",
+    )
+    parser.add_argument(
+        "--model-config",
+        "-c",
+        metavar="CONF",
+        default="classic",
+        choices=model_configs,
+        help="model configs: " + " | ".join(model_configs) + "(default: classic)",
+    )
+    parser.add_argument("--weights", metavar="<path>", help="file with model weights")
+    parser.add_argument(
+        "--precision", metavar="PREC", default="FP16", choices=["AMP", "FP16", "FP32"]
+    )
+    parser.add_argument("--image", metavar="<path>", help="path to classified image")
+
+
+def main(args):
+    imgnet_classes = np.array(json.load(open("./LOC_synset_mapping.json", "r")))
+    model = models.build_resnet(args.arch, args.model_config, verbose=False)
+
+    if args.weights is not None:
+        weights = torch.load(args.weights)
+        model.load_state_dict(weights)
+
+    model = model.cuda()
+
+    if args.precision == "FP16":
+        model = network_to_half(model)
+
+    model.eval()
+
+    with torch.no_grad():
+        input = load_jpeg_from_file(args.image, cuda=True, fp16=args.precision!='FP32')
+
+        output = torch.nn.functional.softmax(model(input), dim=1).cpu().view(-1).numpy()
+        top5 = np.argsort(output)[-5:][::-1]
+
+        print(args.image)
+        for c, v in zip(imgnet_classes[top5], output[top5]):
+            print(f"{c}: {100*v:.1f}%")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
+
+    add_parser_arguments(parser)
+    args = parser.parse_args()
+
+    cudnn.benchmark = True
+
+    main(args)

+ 20 - 0
PyTorch/Classification/ConvNets/image_classification/__init__.py

@@ -0,0 +1,20 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License  (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from . import logger
+from . import dataloaders
+from . import training
+from . import utils
+from . import mixup
+from . import resnet
+from . import smoothing

+ 99 - 17
PyTorch/Classification/RN50v1.5/image_classification/dataloaders.py → PyTorch/Classification/ConvNets/image_classification/dataloaders.py

@@ -1,10 +1,40 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION
+# Copyright (c) 2017-      Facebook, Inc
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of the copyright holder nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 import os
 import torch
 import numpy as np
 import torchvision.datasets as datasets
 import torchvision.transforms as transforms
+from PIL import Image
 
-DATA_BACKEND_CHOICES = ['pytorch']
+DATA_BACKEND_CHOICES = ['pytorch', 'syntetic']
 try:
     from nvidia.dali.plugin.pytorch import DALIClassificationIterator
     from nvidia.dali.pipeline import Pipeline
@@ -16,19 +46,48 @@ except ImportError:
     print("Please install DALI from https://www.github.com/NVIDIA/DALI to run this example.")
 
 
+def load_jpeg_from_file(path, cuda=True, fp16=False):
+    img_transforms = transforms.Compose(
+        [transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor()]
+    )
+
+    img = img_transforms(Image.open(path))
+    with torch.no_grad():
+        # mean and std are not multiplied by 255 as they are in training script
+        # torch dataloader reads data into bytes whereas loading directly
+        # through PIL creates a tensor with floats in [0,1] range
+        mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1)
+        std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1)
+
+        if cuda:
+            mean = mean.cuda()
+            std = std.cuda()
+            img = img.cuda()
+        if fp16:
+            mean = mean.half()
+            std = std.half()
+            img = img.half()
+        else:
+            img = img.float()
+
+        input = img.unsqueeze(0).sub_(mean).div_(std)
+
+    return input
+
+
 class HybridTrainPipe(Pipeline):
     def __init__(self, batch_size, num_threads, device_id, data_dir, crop, dali_cpu=False):
         super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id)
         if torch.distributed.is_initialized():
-            local_rank = torch.distributed.get_rank()
+            rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
         else:
-            local_rank = 0
+            rank = 0
             world_size = 1
 
         self.input = ops.FileReader(
                 file_root = data_dir,
-                shard_id = local_rank,
+                shard_id = rank,
                 num_shards = world_size,
                 random_shuffle = True)
 
@@ -47,7 +106,7 @@ class HybridTrainPipe(Pipeline):
                                                       random_area=[0.08, 1.0],
                                                       num_attempts=100)
 
-        self.res = ops.Resize(device=dali_device, resize_x=crop, resize_y=crop, interp_type=types.INTERP_TRIANGULAR)
+        self.res = ops.Resize(device=dali_device, resize_x=crop, resize_y=crop)
         self.cmnp = ops.CropMirrorNormalize(device = "gpu",
                                             output_dtype = types.FLOAT,
                                             output_layout = types.NCHW,
@@ -70,15 +129,15 @@ class HybridValPipe(Pipeline):
     def __init__(self, batch_size, num_threads, device_id, data_dir, crop, size):
         super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id)
         if torch.distributed.is_initialized():
-            local_rank = torch.distributed.get_rank()
+            rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
         else:
-            local_rank = 0
+            rank = 0
             world_size = 1
 
         self.input = ops.FileReader(
                 file_root = data_dir,
-                shard_id = local_rank,
+                shard_id = rank,
                 num_shards = world_size,
                 random_shuffle = False)
 
@@ -104,7 +163,7 @@ class DALIWrapper(object):
     def gen_wrapper(dalipipeline, num_classes, one_hot):
         for data in dalipipeline:
             input = data[0]["data"]
-            target = data[0]["label"].squeeze().cuda().long()
+            target = torch.reshape(data[0]["label"], [-1]).cuda().long()
             if one_hot:
                 target = expand(num_classes, torch.float, target)
             yield input, target
@@ -121,16 +180,16 @@ class DALIWrapper(object):
 def get_dali_train_loader(dali_cpu=False):
     def gdtl(data_path, batch_size, num_classes, one_hot, workers=5, _worker_init_fn=None, fp16=False):
         if torch.distributed.is_initialized():
-            local_rank = torch.distributed.get_rank()
+            rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
         else:
-            local_rank = 0
+            rank = 0
             world_size = 1
 
         traindir = os.path.join(data_path, 'train')
 
         pipe = HybridTrainPipe(batch_size=batch_size, num_threads=workers,
-                device_id = local_rank,
+                device_id = rank % torch.cuda.device_count(),
                 data_dir = traindir, crop = 224, dali_cpu=dali_cpu)
 
         pipe.build()
@@ -144,18 +203,19 @@ def get_dali_train_loader(dali_cpu=False):
 def get_dali_val_loader():
     def gdvl(data_path, batch_size, num_classes, one_hot, workers=5, _worker_init_fn=None, fp16=False):
         if torch.distributed.is_initialized():
-            local_rank = torch.distributed.get_rank()
+            rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
         else:
-            local_rank = 0
+            rank = 0
             world_size = 1
 
         valdir = os.path.join(data_path, 'val')
 
         pipe = HybridValPipe(batch_size=batch_size, num_threads=workers,
-                device_id = local_rank,
+                device_id = rank % torch.cuda.device_count(),
                 data_dir = valdir,
                 crop = 224, size = 256)
+
         pipe.build()
         val_loader = DALIClassificationIterator(pipe, size = int(pipe.epoch_size("Reader") / world_size))
 
@@ -199,8 +259,8 @@ class PrefetchedWrapper(object):
 
         for next_input, next_target in loader:
             with torch.cuda.stream(stream):
-                next_input = next_input.cuda(async=True)
-                next_target = next_target.cuda(async=True)
+                next_input = next_input.cuda(non_blocking=True)
+                next_target = next_target.cuda(non_blocking=True)
                 if fp16:
                     next_input = next_input.half()
                     if one_hot:
@@ -280,3 +340,25 @@ def get_pytorch_val_loader(data_path, batch_size, num_classes, one_hot, workers=
             collate_fn=fast_collate)
 
     return PrefetchedWrapper(val_loader, num_classes, fp16, one_hot), len(val_loader)
+
+class SynteticDataLoader(object):
+    def __init__(self, fp16, batch_size, num_classes, num_channels, height, width, one_hot):
+        input_data = torch.empty(batch_size, num_channels, height, width).cuda().normal_(0, 1.0)
+        if one_hot:
+            input_target = torch.empty(batch_size, num_classes).cuda()
+            input_target[:, 0] = 1.0
+        else:
+            input_target = torch.randint(0, num_classes, (batch_size,))
+        input_target=input_target.cuda()
+        if fp16:
+            input_data = input_data.half()
+
+        self.input_data = input_data
+        self.input_target = input_target
+
+    def __iter__(self):
+        while True:
+            yield self.input_data, self.input_target
+
+def get_syntetic_loader(data_path, batch_size, num_classes, one_hot, workers=None, _worker_init_fn=None, fp16=False):
+    return SynteticDataLoader(fp16, batch_size, 1000, 3, 224, 224, one_hot), -1

+ 29 - 2
PyTorch/Classification/RN50v1.5/image_classification/logger.py → PyTorch/Classification/ConvNets/image_classification/logger.py

@@ -1,3 +1,32 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION
+# Copyright (c) 2017-      Facebook, Inc
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of the copyright holder nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 import random
 import json
 from collections import OrderedDict
@@ -294,5 +323,3 @@ class StdOutBackend(object):
 
     def end(self):
         pass
-
-

+ 13 - 0
PyTorch/Classification/RN50v1.5/image_classification/mixup.py → PyTorch/Classification/ConvNets/image_classification/mixup.py

@@ -1,3 +1,16 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License  (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import torch
 import torch.nn as nn
 import numpy as np

+ 354 - 0
PyTorch/Classification/ConvNets/image_classification/resnet.py

@@ -0,0 +1,354 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION
+# Copyright (c) 2017-      Facebook, Inc
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of the copyright holder nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import math
+import torch
+import torch.nn as nn
+import numpy as np
+
+__all__ = ['ResNet', 'build_resnet', 'resnet_versions', 'resnet_configs']
+
+# ResNetBuilder {{{
+
+class ResNetBuilder(object):
+    def __init__(self, version, config):
+        self.conv3x3_cardinality = 1 if 'cardinality' not in version.keys() else version['cardinality']
+        self.config = config
+
+    def conv(self, kernel_size, in_planes, out_planes, groups=1, stride=1):
+        conv = nn.Conv2d(
+                in_planes, out_planes,
+                kernel_size=kernel_size, groups=groups,
+                stride=stride, padding=int((kernel_size - 1)/2),
+                bias=False)
+
+        if self.config['nonlinearity'] == 'relu': 
+            nn.init.kaiming_normal_(conv.weight,
+                    mode=self.config['conv_init'],
+                    nonlinearity=self.config['nonlinearity'])
+
+        return conv
+
+    def conv3x3(self, in_planes, out_planes, stride=1):
+        """3x3 convolution with padding"""
+        c = self.conv(3, in_planes, out_planes, groups=self.conv3x3_cardinality, stride=stride)
+        return c
+
+    def conv1x1(self, in_planes, out_planes, stride=1):
+        """1x1 convolution with padding"""
+        c = self.conv(1, in_planes, out_planes, stride=stride)
+        return c
+
+    def conv7x7(self, in_planes, out_planes, stride=1):
+        """7x7 convolution with padding"""
+        c = self.conv(7, in_planes, out_planes, stride=stride)
+        return c
+
+    def conv5x5(self, in_planes, out_planes, stride=1):
+        """5x5 convolution with padding"""
+        c = self.conv(5, in_planes, out_planes, stride=stride)
+        return c
+
+    def batchnorm(self, planes, last_bn=False):
+        bn = nn.BatchNorm2d(planes)
+        gamma_init_val = 0 if last_bn and self.config['last_bn_0_init'] else 1
+        nn.init.constant_(bn.weight, gamma_init_val)
+        nn.init.constant_(bn.bias, 0)
+
+        return bn
+
+    def activation(self):
+        return self.config['activation']()
+
+# ResNetBuilder }}}
+
+# BasicBlock {{{
+class BasicBlock(nn.Module):
+    def __init__(self, builder, inplanes, planes, expansion, stride=1, downsample=None):
+        super(BasicBlock, self).__init__()
+        self.conv1 = builder.conv3x3(inplanes, planes, stride)
+        self.bn1 = builder.batchnorm(planes)
+        self.relu = builder.activation()
+        self.conv2 = builder.conv3x3(planes, planes*expansion)
+        self.bn2 = builder.batchnorm(planes*expansion, last_bn=True)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        if self.bn1 is not None:
+            out = self.bn1(out)
+
+        out = self.relu(out)
+
+        out = self.conv2(out)
+
+        if self.bn2 is not None:
+            out = self.bn2(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        out += residual
+        out = self.relu(out)
+
+        return out
+# BasicBlock }}}
+
+# SqueezeAndExcitation {{{
+class SqueezeAndExcitation(nn.Module):
+    def __init__(self, planes, squeeze):
+        super(SqueezeAndExcitation, self).__init__()
+        self.squeeze = nn.Linear(planes, squeeze)
+        self.expand = nn.Linear(squeeze, planes)
+        self.relu = nn.ReLU(inplace=True)
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, x):
+        out = torch.mean(x.view(x.size(0), x.size(1), -1), 2)
+        out = self.squeeze(out)
+        out = self.relu(out)
+        out = self.expand(out)
+        out = self.sigmoid(out)
+        out = out.unsqueeze(2).unsqueeze(3)
+
+        return out
+
+# }}}
+
+# Bottleneck {{{
+class Bottleneck(nn.Module):
+    def __init__(self, builder, inplanes, planes, expansion, stride=1, se=False, se_squeeze=16, downsample=None):
+        super(Bottleneck, self).__init__()
+        self.conv1 = builder.conv1x1(inplanes, planes)
+        self.bn1 = builder.batchnorm(planes)
+        self.conv2 = builder.conv3x3(planes, planes, stride=stride)
+        self.bn2 = builder.batchnorm(planes)
+        self.conv3 = builder.conv1x1(planes, planes * expansion)
+        self.bn3 = builder.batchnorm(planes * expansion, last_bn=True)
+        self.relu = builder.activation()
+        self.downsample = downsample
+        self.stride = stride
+        self.squeeze = SqueezeAndExcitation(planes*expansion, se_squeeze) if se else None
+
+    def forward(self, x):
+        residual = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+
+        out = self.conv3(out)
+        out = self.bn3(out)
+
+        if self.downsample is not None:
+            residual = self.downsample(x)
+
+        if self.squeeze is None:
+            out += residual
+        else:
+            out = torch.addcmul(residual, 1.0, out, self.squeeze(out))
+
+        out = self.relu(out)
+
+        return out
+
+def SEBottleneck(builder, inplanes, planes, expansion, stride=1, downsample=None):
+    return Bottleneck(builder, inplanes, planes, expansion, stride=stride, se=True, se_squeeze=16, downsample=downsample)
+# Bottleneck }}}
+
+# ResNet {{{
+class ResNet(nn.Module):
+    def __init__(self, builder, block, expansion, layers, widths, num_classes=1000):
+        self.inplanes = 64
+        super(ResNet, self).__init__()
+        self.conv1 = builder.conv7x7(3, 64, stride=2)
+        self.bn1 = builder.batchnorm(64)
+        self.relu = builder.activation()
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+        self.layer1 = self._make_layer(builder, block, expansion, widths[0], layers[0])
+        self.layer2 = self._make_layer(builder, block, expansion, widths[1], layers[1], stride=2)
+        self.layer3 = self._make_layer(builder, block, expansion, widths[2], layers[2], stride=2)
+        self.layer4 = self._make_layer(builder, block, expansion, widths[3], layers[3], stride=2)
+        self.avgpool = nn.AdaptiveAvgPool2d(1)
+        self.fc = nn.Linear(widths[3] * expansion, num_classes)
+
+    def _make_layer(self, builder, block, expansion, planes, blocks, stride=1):
+        downsample = None
+        if stride != 1 or self.inplanes != planes * expansion:
+            dconv = builder.conv1x1(self.inplanes, planes * expansion,
+                                    stride=stride)
+            dbn = builder.batchnorm(planes * expansion)
+            if dbn is not None:
+                downsample = nn.Sequential(dconv, dbn)
+            else:
+                downsample = dconv
+
+        layers = []
+        layers.append(block(builder, self.inplanes, planes, expansion, stride=stride, downsample=downsample))
+        self.inplanes = planes * expansion
+        for i in range(1, blocks):
+            layers.append(block(builder, self.inplanes, planes, expansion))
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        if self.bn1 is not None:
+            x = self.bn1(x)
+        x = self.relu(x)
+        x = self.maxpool(x)
+
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+
+        x = self.avgpool(x)
+        x = x.view(x.size(0), -1)
+        x = self.fc(x)
+
+        return x
+# ResNet }}}
+
+resnet_configs = {
+        'classic' : {
+            'conv' : nn.Conv2d,
+            'conv_init' : 'fan_out',
+            'nonlinearity' : 'relu',
+            'last_bn_0_init' : False,
+            'activation' : lambda: nn.ReLU(inplace=True),
+            },
+        'fanin' : {
+            'conv' : nn.Conv2d,
+            'conv_init' : 'fan_in',
+            'nonlinearity' : 'relu',
+            'last_bn_0_init' : False,
+            'activation' : lambda: nn.ReLU(inplace=True),
+            },
+        'grp-fanin' : {
+            'conv' : nn.Conv2d,
+            'conv_init' : 'fan_in',
+            'nonlinearity' : 'relu',
+            'last_bn_0_init' : False,
+            'activation' : lambda: nn.ReLU(inplace=True),
+            },
+        'grp-fanout' : {
+            'conv' : nn.Conv2d,
+            'conv_init' : 'fan_out',
+            'nonlinearity' : 'relu',
+            'last_bn_0_init' : False,
+            'activation' : lambda: nn.ReLU(inplace=True),
+            },
+        }
+
+resnet_versions = {
+        'resnet18' : {
+            'net' : ResNet,
+            'block' : BasicBlock,
+            'layers' : [2, 2, 2, 2],
+            'widths' : [64, 128, 256, 512],
+            'expansion' : 1,
+            'num_classes' : 1000,
+            },
+         'resnet34' : {
+            'net' : ResNet,
+            'block' : BasicBlock,
+            'layers' : [3, 4, 6, 3],
+            'widths' : [64, 128, 256, 512],
+            'expansion' : 1,
+            'num_classes' : 1000,
+            },
+         'resnet50' : {
+            'net' : ResNet,
+            'block' : Bottleneck,
+            'layers' : [3, 4, 6, 3],
+            'widths' : [64, 128, 256, 512],
+            'expansion' : 4,
+            'num_classes' : 1000,
+            },
+        'resnet101' : {
+            'net' : ResNet,
+            'block' : Bottleneck,
+            'layers' : [3, 4, 23, 3],
+            'widths' : [64, 128, 256, 512],
+            'expansion' : 4,
+            'num_classes' : 1000,
+            },
+        'resnet152' : {
+            'net' : ResNet,
+            'block' : Bottleneck,
+            'layers' : [3, 8, 36, 3],
+            'widths' : [64, 128, 256, 512],
+            'expansion' : 4,
+            'num_classes' : 1000,
+            },
+        'resnext101-32x4d' : {
+            'net' : ResNet,
+            'block' : Bottleneck,
+            'cardinality' : 32,
+            'layers' : [3, 4, 23, 3],
+            'widths' : [128, 256, 512, 1024],
+            'expansion' : 2,
+            'num_classes' : 1000,
+            },
+        'se-resnext101-32x4d' : {
+            'net' : ResNet,
+            'block' : SEBottleneck,
+            'cardinality' : 32,
+            'layers' : [3, 4, 23, 3],
+            'widths' : [128, 256, 512, 1024],
+            'expansion' : 2,
+            'num_classes' : 1000,
+            },
+        }
+
+
+def build_resnet(version, config, verbose=True):
+    version = resnet_versions[version]
+    config = resnet_configs[config]
+
+    builder = ResNetBuilder(version, config)
+    if verbose:
+        print("Version: {}".format(version))
+        print("Config: {}".format(config))
+    model = version['net'](builder,
+                           version['block'],
+                           version['expansion'],
+                           version['layers'],
+                           version['widths'],
+                           version['num_classes'])
+
+    return model

+ 13 - 1
PyTorch/Classification/RN50v1.5/image_classification/smoothing.py → PyTorch/Classification/ConvNets/image_classification/smoothing.py

@@ -1,3 +1,16 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License  (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import torch
 import torch.nn as nn
 
@@ -23,4 +36,3 @@ class LabelSmoothing(nn.Module):
         smooth_loss = -logprobs.mean(dim=-1)
         loss = self.confidence * nll_loss + self.smoothing * smooth_loss
         return loss.mean()
-

+ 35 - 6
PyTorch/Classification/RN50v1.5/image_classification/training.py → PyTorch/Classification/ConvNets/image_classification/training.py

@@ -1,3 +1,32 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION
+# Copyright (c) 2017-      Facebook, Inc
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of the copyright holder nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 import os
 import time
 import numpy as np
@@ -202,11 +231,11 @@ def get_train_step(model_and_loss, optimizer, fp16, use_amp = False, batch_size_
 def train(train_loader, model_and_loss, optimizer, lr_scheduler, fp16, logger, epoch, use_amp=False, prof=-1, batch_size_multiplier=1, register_metrics=True):
 
     if register_metrics and logger is not None:
-        logger.register_metric('train.top1', log.AverageMeter(), log_level = 0)
-        logger.register_metric('train.top5', log.AverageMeter(), log_level = 0)
+        logger.register_metric('train.top1', log.AverageMeter(), log_level = 1)
+        logger.register_metric('train.top5', log.AverageMeter(), log_level = 1)
         logger.register_metric('train.loss', log.AverageMeter(), log_level = 0)
         logger.register_metric('train.compute_ips', log.AverageMeter(), log_level=1)
-        logger.register_metric('train.total_ips', log.AverageMeter(), log_level=1)
+        logger.register_metric('train.total_ips', log.AverageMeter(), log_level=0)
         logger.register_metric('train.data_time', log.AverageMeter(), log_level=1)
         logger.register_metric('train.compute_time', log.AverageMeter(), log_level=1)
 
@@ -278,9 +307,9 @@ def validate(val_loader, model_and_loss, fp16, logger, epoch, prof=-1, register_
         logger.register_metric('val.top5',         log.AverageMeter(), log_level = 0)
         logger.register_metric('val.loss',         log.AverageMeter(), log_level = 0)
         logger.register_metric('val.compute_ips',  log.AverageMeter(), log_level = 1)
-        logger.register_metric('val.total_ips',    log.AverageMeter(), log_level = 1)
+        logger.register_metric('val.total_ips',    log.AverageMeter(), log_level = 0)
         logger.register_metric('val.data_time',    log.AverageMeter(), log_level = 1)
-        logger.register_metric('val.compute_time', log.AverageMeter(), log_level = 1)
+        logger.register_metric('val.compute_latency', log.AverageMeter(), log_level = 1)
 
     step = get_val_step(model_and_loss)
 
@@ -313,7 +342,7 @@ def validate(val_loader, model_and_loss, fp16, logger, epoch, prof=-1, register_
             logger.log_metric('val.compute_ips', calc_ips(bs, it_time - data_time))
             logger.log_metric('val.total_ips', calc_ips(bs, it_time))
             logger.log_metric('val.data_time', data_time)
-            logger.log_metric('val.compute_time', it_time - data_time)
+            logger.log_metric('val.compute_latency', it_time - data_time)
 
         end = time.time()
 

+ 29 - 0
PyTorch/Classification/RN50v1.5/image_classification/utils.py → PyTorch/Classification/ConvNets/image_classification/utils.py

@@ -1,3 +1,32 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION
+# Copyright (c) 2017-      Facebook, Inc
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of the copyright holder nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 import os
 import numpy as np
 import torch

+ 0 - 0
PyTorch/Classification/RN50v1.5/img/.gitkeep → PyTorch/Classification/ConvNets/img/.gitkeep


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/img/ACCvsFLOPS.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/img/LATvsTHR.png


+ 52 - 18
PyTorch/Classification/RN50v1.5/main.py → PyTorch/Classification/ConvNets/main.py

@@ -1,3 +1,32 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION
+# Copyright (c) 2017-      Facebook, Inc
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# * Redistributions of source code must retain the above copyright notice, this
+#   list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright notice,
+#   this list of conditions and the following disclaimer in the documentation
+#   and/or other materials provided with the distribution.
+#
+# * Neither the name of the copyright holder nor the names of its
+#   contributors may be used to endorse or promote products derived from
+#   this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 import argparse
 import os
 import shutil
@@ -41,7 +70,10 @@ def add_parser_arguments(parser):
     parser.add_argument('data', metavar='DIR',
                         help='path to dataset')
     parser.add_argument('--data-backend', metavar='BACKEND', default='dali-cpu',
-                        choices=DATA_BACKEND_CHOICES)
+                        choices=DATA_BACKEND_CHOICES,
+                        help='data backend: ' +
+                        ' | '.join(DATA_BACKEND_CHOICES) +
+                        ' (default: dali-cpu)')
 
     parser.add_argument('--arch', '-a', metavar='ARCH', default='resnet50',
                         choices=model_names,
@@ -58,17 +90,17 @@ def add_parser_arguments(parser):
                         help='number of data loading workers (default: 5)')
     parser.add_argument('--epochs', default=90, type=int, metavar='N',
                         help='number of total epochs to run')
-    parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
-                        help='manual epoch number (useful on restarts)')
     parser.add_argument('-b', '--batch-size', default=256, type=int,
                         metavar='N', help='mini-batch size (default: 256) per gpu')
 
     parser.add_argument('--optimizer-batch-size', default=-1, type=int,
-                        metavar='N', help='size of a total batch size, for simulating bigger batches')
+                        metavar='N', help='size of a total batch size, for simulating bigger batches using gradient accumulation')
 
     parser.add_argument('--lr', '--learning-rate', default=0.1, type=float,
                         metavar='LR', help='initial learning rate')
-    parser.add_argument('--lr-schedule', default='step', type=str, metavar='SCHEDULE', choices=['step','linear','cosine'])
+    parser.add_argument('--lr-schedule', default='step', type=str, metavar='SCHEDULE',
+                        choices=['step','linear','cosine'],
+                        help='Type of LR schedule: {}, {}, {}'.format('step','linear','cosine'))
 
     parser.add_argument('--warmup', default=0, type=int,
                         metavar='E', help='number of warmup epochs')
@@ -83,9 +115,9 @@ def add_parser_arguments(parser):
     parser.add_argument('--weight-decay', '--wd', default=1e-4, type=float,
                         metavar='W', help='weight decay (default: 1e-4)')
     parser.add_argument('--bn-weight-decay', action='store_true',
-                        help='use weight_decay on batch normalization learnable parameters, default: false)')
+                        help='use weight_decay on batch normalization learnable parameters, (default: false)')
     parser.add_argument('--nesterov', action='store_true',
-                        help='use nesterov momentum, default: false)')
+                        help='use nesterov momentum, (default: false)')
 
     parser.add_argument('--print-freq', '-p', default=10, type=int,
                         metavar='N', help='print frequency (default: 10)')
@@ -101,31 +133,28 @@ def add_parser_arguments(parser):
     parser.add_argument('--dynamic-loss-scale', action='store_true',
                         help='Use dynamic loss scaling.  If supplied, this argument supersedes ' +
                         '--static-loss-scale.')
-    parser.add_argument('--prof', type=int, default=-1,
+    parser.add_argument('--prof', type=int, default=-1, metavar='N',
                         help='Run only N iterations')
     parser.add_argument('--amp', action='store_true',
                         help='Run model AMP (automatic mixed precision) mode.')
 
-    parser.add_argument("--local_rank", default=0, type=int)
+    parser.add_argument("--local_rank", default=0, type=int, help='Local rank of python process. Set up by distributed launcher')
 
     parser.add_argument('--seed', default=None, type=int,
-                        help='random seed used for np and pytorch')
+                        help='random seed used for numpy and pytorch')
 
     parser.add_argument('--gather-checkpoints', action='store_true',
-                        help='Gather checkpoints throughout the training')
+                        help='Gather checkpoints throughout the training, without this flag only best and last checkpoints will be stored')
 
     parser.add_argument('--raport-file', default='experiment_raport.json', type=str,
                         help='file in which to store JSON experiment raport')
 
-    parser.add_argument('--final-weights', default='model.pth.tar', type=str,
-                        help='file in which to store final model weights')
-
     parser.add_argument('--evaluate', action='store_true', help='evaluate checkpoint/model')
     parser.add_argument('--training-only', action='store_true', help='do not evaluate')
 
-    parser.add_argument('--no-checkpoints', action='store_false', dest='save_checkpoints')
+    parser.add_argument('--no-checkpoints', action='store_false', dest='save_checkpoints', help='do not store any checkpoints, useful for benchmarking')
 
-    parser.add_argument('--workspace', type=str, default='./')
+    parser.add_argument('--workspace', type=str, default='./', metavar='DIR', help='path to directory where checkpoints will be stored')
 
 
 def main(args):
@@ -188,12 +217,13 @@ def main(args):
         else:
             print("=> no pretrained weights found at '{}'".format(args.resume))
 
+    start_epoch = 0
     # optionally resume from a checkpoint
     if args.resume:
         if os.path.isfile(args.resume):
             print("=> loading checkpoint '{}'".format(args.resume))
             checkpoint = torch.load(args.resume, map_location = lambda storage, loc: storage.cuda(args.gpu))
-            args.start_epoch = checkpoint['epoch']
+            start_epoch = checkpoint['epoch']
             best_prec1 = checkpoint['best_prec1']
             model_state = checkpoint['state_dict']
             optimizer_state = checkpoint['optimizer']
@@ -230,6 +260,10 @@ def main(args):
     elif args.data_backend == 'dali-cpu':
         get_train_loader = get_dali_train_loader(dali_cpu=True)
         get_val_loader = get_dali_val_loader()
+    elif args.data_backend == 'syntetic':
+        get_val_loader = get_syntetic_loader
+        get_train_loader = get_syntetic_loader
+
 
     train_loader, train_loader_len = get_train_loader(args.data, args.batch_size, 1000, args.mixup > 0.0, workers=args.workers, fp16=args.fp16)
     if args.mixup != 0.0:
@@ -283,7 +317,7 @@ def main(args):
         train_loader, val_loader, args.epochs,
         args.fp16, logger, should_backup_checkpoint(args), use_amp=args.amp,
         batch_size_multiplier = batch_size_multiplier,
-        start_epoch = args.start_epoch, best_prec1 = best_prec1, prof=args.prof,
+        start_epoch = start_epoch, best_prec1 = best_prec1, prof=args.prof,
         skip_training = args.evaluate, skip_validation = args.training_only,
         save_checkpoints=args.save_checkpoints and not args.evaluate, checkpoint_dir=args.workspace)
     exp_duration = time.time() - exp_start_time

+ 71 - 0
PyTorch/Classification/RN50v1.5/multiproc.py → PyTorch/Classification/ConvNets/multiproc.py

@@ -1,3 +1,74 @@
+# From PyTorch:
+#
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2016-     Facebook, Inc            (Adam Paszke)
+# Copyright (c) 2014-     Facebook, Inc            (Soumith Chintala)
+# Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
+# Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
+# Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
+# Copyright (c) 2011-2013 NYU                      (Clement Farabet)
+# Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
+# Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
+# Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
+#
+# From Caffe2:
+#
+# Copyright (c) 2016-present, Facebook Inc. All rights reserved.
+#
+# All contributions by Facebook:
+# Copyright (c) 2016 Facebook Inc.
+#
+# All contributions by Google:
+# Copyright (c) 2015 Google Inc.
+# All rights reserved.
+#
+# All contributions by Yangqing Jia:
+# Copyright (c) 2015 Yangqing Jia
+# All rights reserved.
+#
+# All contributions from Caffe:
+# Copyright(c) 2013, 2014, 2015, the respective contributors
+# All rights reserved.
+#
+# All other contributions:
+# Copyright(c) 2015, 2016 the respective contributors
+# All rights reserved.
+#
+# Caffe2 uses a copyright model similar to Caffe: each contributor holds
+# copyright over their contributions to Caffe2. The project versioning records
+# all such contribution and copyright details. If a contributor wants to further
+# mark their specific copyright on a particular contribution, they should
+# indicate their copyright solely in the commit message of the change when it is
+# committed.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#
+# 3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
+#    and IDIAP Research Institute nor the names of its contributors may be
+#    used to endorse or promote products derived from this software without
+#    specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
 import sys
 import subprocess
 import os

+ 602 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/README.md

@@ -0,0 +1,602 @@
+# ResNet50 v1.5 For PyTorch
+
+## Table Of Contents
+* [Model overview](#model-overview)
+  * [Default configuration](#default-configuration)
+  * [Mixed precision training](#mixed-precision-training)
+    * [Enabling mixed precision](#enabling-mixed-precision)
+* [Setup](#setup)
+  * [Requirements](#requirements)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+* [Performance](#performance)
+  * [Benchmarking](#benchmarking)
+    * [Training performance benchmark](#training-performance-benchmark)
+    * [Inference performance benchmark](#inference-performance-benchmark)
+  * [Results](#results)
+    * [Training accuracy results](#training-accuracy-results)
+      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+      * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
+    * [Training performance results](#training-performance-results)
+      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+      * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
+    * [Inference performance results](#inference-performance-results)
+* [Release notes](#release-notes)
+  * [Changelog](#changelog)
+  * [Known issues](#known-issues)
+
+## Model overview
+The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).
+
+The difference between v1 and v1.5 is that, in the bottleneck blocks which requires
+downsampling, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution.
+
+This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a smallperformance drawback (~5% imgs/sec).
+
+The model is initialized as described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf)
+
+### Default configuration
+
+#### Optimizer
+
+This model trains for 90 epochs, with standard ResNet v1.5 setup:
+
+* SGD with momentum (0.875)
+
+* Learning rate = 0.256 for 256 batch size, for other batch sizes we lineary
+scale the learning rate.
+
+* Learning rate schedule - we use cosine LR schedule
+
+* For bigger batch sizes (512 and up) we use linear warmup of the learning rate
+during first couple of epochs
+according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
+Warmup length depends on total training length.
+
+* Weight decay: 3.0517578125e-05 (1/32768).
+
+* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
+
+* Label Smoothing: 0.1
+
+* We train for:
+
+    * 50 Epochs -> configuration that reaches 75.9% top1 accuracy
+
+    * 90 Epochs -> 90 epochs is a standard for ResNet50
+
+    * 250 Epochs -> best possible accuracy.
+
+* For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
+
+
+#### Data augmentation
+
+This model uses the following data augmentation:
+
+* For training:
+  * Normalization
+  * Random resized crop to 224x224
+    * Scale from 8% to 100%
+    * Aspect ratio from 3/4 to 4/3
+  * Random horizontal flip
+
+* For inference:
+  * Normalization
+  * Scale to 256x256
+  * Center crop to 224x224
+
+#### Other training recipes
+
+This script does not target any specific benchmark.
+There are changes that others have made which can speed up convergence and/or increase accuracy.
+
+One of the more popular training recipes is provided by [fast.ai](https://github.com/fastai/imagenet-fast).
+
+The fast.ai recipe introduces many changes to the training procedure, one of which is progressive resizing of the training images.
+
+The first part of training uses 128px images, the middle part uses 224px images, and the last part uses 288px images.
+The final validation is performed on 288px images.
+
+Training script in this repository performs validation on 224px images, just like the original paper described.
+
+These two approaches can't be directly compared, since the fast.ai recipe requires validation on 288px images,
+and this recipe keeps the original assumption that validation is done on 224px images.
+
+Using 288px images means that a lot more FLOPs are needed during inference to reach the same accuracy.
+
+### DALI
+
+For DGX2 configurations we use [NVIDIA DALI](https://github.com/NVIDIA/DALI),
+which speeds up data loading when CPU becomes a bottleneck.
+DALI can also use CPU, and it outperforms the pytorch native dataloader.
+
+Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
+For DGX1 we recommend `--data-backends dali-cpu`, for DGX2 we recommend `--data-backends dali-gpu`.
+
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+#### Enabling mixed precision
+
+Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP),  library from [APEX](https://github.com/NVIDIA/apex) that casts variables to half-precision upon retrieval,
+while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients.
+In PyTorch, loss scaling can be easily applied by using scale_loss() method provided by AMP. The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
+
+For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
+
+To enable mixed precision, you can:
+- Import AMP from APEX, for example:
+
+  ```
+  from apex import amp
+  ```
+- Initialize an AMP handle, for example:
+
+  ```
+  amp_handle = amp.init(enabled=True, verbose=True)
+  ```
+- Wrap your optimizer with the AMP handle, for example:
+
+  ```
+  optimizer = amp_handle.wrap_optimizer(optimizer)
+  ```
+- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
+  - Default backpropagate for FP32:
+
+    ```
+    losses.backward()
+    ```
+  - Scale loss and backpropagate with AMP:
+
+    ```
+    with optimizer.scale_loss(losses) as scaled_losses:
+       scaled_losses.backward()
+    ```
+
+## Setup
+
+### Requirements
+
+Ensure you meet the following requirements:
+
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.05-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* (optional) NVIDIA Volta GPU (see section below) - for best training performance using mixed precision
+
+For more information about how to get started with NGC containers, see the
+following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
+DGX Documentation:
+* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
+* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
+
+## Quick Start Guide
+
+### 1. Clone the repository.
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/Classification/RN50v1.5/
+```
+
+### 2. Download and preprocess the dataset.
+
+The ResNet50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
+
+PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
+
+1. Download the images from http://image-net.org/download-images
+
+2. Extract the training data:
+  ```bash
+  mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
+  tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
+  find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
+  cd ..
+  ```
+
+3. Extract the validation data and move the images to subfolders:
+  ```bash
+  mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
+  wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
+  ```
+
+The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
+
+### 3. Build the RN50v1.5 PyTorch NGC container.
+
+```
+docker build . -t nvidia_rn50
+```
+
+### 4. Start an interactive session in the NGC container to run training/inference.
+```
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_rn50
+```
+
+### 5. Running training
+
+To run training for a standard configuration (DGX1V/DGX2V, FP16/FP32, 50/90/250 Epochs),
+run one of the scripts in the `./resnet50v1.5/training` directory
+called `./resnet50v1.5/training/{DGX1, DGX2}_RN50_{AMP, FP16, FP32}_{50,90,250}E.sh`.
+
+Ensure imagenet is mounted in the `/imagenet` directory.
+
+Example:
+    `bash ./resnet50v1.5/training/DGX1_RN50_FP16_250E.sh`
+   
+To run a non standard configuration use:
+
+* For 1 GPU
+    * FP32
+        `python ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 <path to imagenet>`
+    * FP16
+        `python ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 --fp16 --static-loss-scale 256 <path to imagenet>`
+    * AMP
+        `python ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 --amp --static-loss-scale 256 <path to imagenet>`
+
+* For multiple GPUs
+    * FP32
+        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 <path to imagenet>`
+    * FP16
+        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 --fp16 --static-loss-scale 256 <path to imagenet>`
+    * AMP
+        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 --amp --static-loss-scale 256 <path to imagenet>`
+
+Use `python ./main.py -h` to obtain the list of available options in the `main.py` script.
+
+### 6. Running inference
+
+To run inference on a checkpointed model run:
+
+`python ./main.py --arch resnet50 --evaluate --epochs 1 --resume <path to checkpoint> -b <batch size> <path to imagenet>`
+
+## Advanced
+
+### Commmand-line options:
+
+```
+usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
+               [--model-config CONF] [-j N] [--epochs N] [-b N]
+               [--optimizer-batch-size N] [--lr LR] [--lr-schedule SCHEDULE]
+               [--warmup E] [--label-smoothing S] [--mixup ALPHA]
+               [--momentum M] [--weight-decay W] [--bn-weight-decay]
+               [--nesterov] [--print-freq N] [--resume PATH]
+               [--pretrained-weights PATH] [--fp16]
+               [--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
+               [--prof N] [--amp] [--local_rank LOCAL_RANK] [--seed SEED]
+               [--gather-checkpoints] [--raport-file RAPORT_FILE] [--evaluate]
+               [--training-only] [--no-checkpoints] [--workspace DIR]
+               DIR
+
+PyTorch ImageNet Training
+
+positional arguments:
+  DIR                   path to dataset
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --data-backend BACKEND
+                        data backend: pytorch | dali-gpu | dali-cpu (default:
+                        pytorch)
+  --arch ARCH, -a ARCH  model architecture: resnet18 | resnet34 | resnet50 |
+                        resnet101 | resnet152 (default: resnet50)
+  --model-config CONF, -c CONF
+                        model configs: classic | fanin(default: classic)
+  -j N, --workers N     number of data loading workers (default: 5)
+  --epochs N            number of total epochs to run
+  -b N, --batch-size N  mini-batch size (default: 256) per gpu
+  --optimizer-batch-size N
+                        size of a total batch size, for simulating bigger
+                        batches using gradient accumulation
+  --lr LR, --learning-rate LR
+                        initial learning rate
+  --lr-schedule SCHEDULE
+                        Type of LR schedule: step, linear, cosine
+  --warmup E            number of warmup epochs
+  --label-smoothing S   label smoothing
+  --mixup ALPHA         mixup alpha
+  --momentum M          momentum
+  --weight-decay W, --wd W
+                        weight decay (default: 1e-4)
+  --bn-weight-decay     use weight_decay on batch normalization learnable
+                        parameters, (default: false)
+  --nesterov            use nesterov momentum, (default: false)
+  --print-freq N, -p N  print frequency (default: 10)
+  --resume PATH         path to latest checkpoint (default: none)
+  --pretrained-weights PATH
+                        load weights from here
+  --fp16                Run model fp16 mode.
+  --static-loss-scale STATIC_LOSS_SCALE
+                        Static loss scale, positive power of 2 values can
+                        improve fp16 convergence.
+  --dynamic-loss-scale  Use dynamic loss scaling. If supplied, this argument
+                        supersedes --static-loss-scale.
+  --prof N              Run only N iterations
+  --amp                 Run model AMP (automatic mixed precision) mode.
+  --local_rank LOCAL_RANK
+                        Local rank of python process. Set up by distributed
+                        launcher
+  --seed SEED           random seed used for numpy and pytorch
+  --gather-checkpoints  Gather checkpoints throughout the training, without
+                        this flag only best and last checkpoints will be
+                        stored
+  --raport-file RAPORT_FILE
+                        file in which to store JSON experiment raport
+  --evaluate            evaluate checkpoint/model
+  --training-only       do not evaluate
+  --no-checkpoints      do not store any checkpoints, useful for benchmarking
+  --workspace DIR       path to directory where checkpoints will be stored
+```
+
+## Performance
+
+### Benchmarking
+
+#### Training performance benchmark
+
+To benchmark training, run:
+
+* For 1 GPU
+    * FP32
+`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP16
+`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --fp16 --static-loss-scale 256 <path to imagenet>`
+    * AMP
+`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
+* For multiple GPUs
+    * FP32
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * FP16
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --fp16 --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+    * AMP
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+
+Each of this scripts will run 100 iterations and save results in benchmark.json file
+
+#### Inference performance benchmark
+
+To benchmark inference, run:
+
+* FP32
+
+`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
+
+* FP16
+
+`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --fp16 <path to imagenet>`
+
+* AMP
+
+`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
+
+Each of this scripts will run 100 iterations and save results in benchmark.json file
+
+
+
+### Results
+
+#### Training Accuracy Results
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
+| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
+|:-:|:-:|:-:|
+| 50 | 76.25 +/- 0.04 | 76.26 +/- 0.07 |
+| 90 | 77.23 +/- 0.04 | 77.08 +/- 0.08 |
+| 250 | 78.42 +/- 0.04 | 78.30 +/- 0.16 |
+
+##### NVIDIA DGX-2 (16x V100 32G)
+
+| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
+|:-:|:-:|:-:|
+| 50 | 75.81 +/- 0.08 | 76.04 +/- 0.05 |
+| 90 | 77.10 +/- 0.06 | 77.23 +/- 0.04 |
+| 250 | 78.59 +/- 0.13 | 78.46 +/- 0.03 |
+
+
+
+##### Example plots (90 Epochs configuration on DGX1V)
+
+![ValidationLoss](./img/loss_plot.png)
+
+![ValidationTop1](./img/top1_plot.png)
+
+![ValidationTop5](./img/top5_plot.png)
+
+#### Training Performance Results
+
+##### NVIDIA DGX1-16G (8x V100 16G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 893.09 img/s | 380.44 img/s | 2.35x | 1.00x | 1.00x |
+| 8 | 6888.75 img/s | 2945.37 img/s | 2.34x | 7.71x | 7.74x |
+
+##### NVIDIA DGX1-32G (8x V100 32G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 849.63 img/s | 373.93 img/s | 2.27x | 1.00x | 1.00x |
+| 8 | 6614.15 img/s | 2911.22 img/s | 2.27x | 7.78x | 7.79x |
+
+##### NVIDIA DGX2 (16x V100 32G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 894.41 img/s | 402.23 img/s | 2.22x | 1.00x | 1.00x |
+| 16 | 13443.82 img/s | 6263.41 img/s | 2.15x | 15.03x | 15.57x |
+
+#### Training Time for 90 Epochs
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
+| **# of GPUs** | **Mixed Precision training time** | **FP32 training time** |
+|:-:|:-:|:-:|
+| 1 | ~ 41 h | ~ 95 h |
+| 8 | ~ 7 h | ~ 14 h |
+
+##### NVIDIA DGX-2 (16x V100 32G)
+
+| **# of GPUs** | **Mixed Precision training time** | **FP32 training time** |
+|:-:|:-:|:-:|
+| 1 | ~ 41 h | ~ 90 h |
+| 16 | ~ 5 h | ~ 8 h |
+
+
+
+#### Inference Performance Results
+
+##### NVIDIA VOLTA V100 16G on DGX1V
+
+###### FP32 Inference Latency
+
+| **batch_size** | **FP32 50.0%** | **FP32 90.0%** | **FP32 99.0%** | **FP32 100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 6.91ms | 7.25ms | 10.92ms | 11.70ms |
+| 2 | 7.16ms | 7.41ms | 9.11ms | 14.58ms |
+| 4 | 7.29ms | 7.58ms | 10.09ms | 13.75ms |
+| 8 | 9.81ms | 10.46ms | 12.75ms | 15.36ms |
+| 16 | 15.76ms | 15.88ms | 16.63ms | 17.49ms |
+| 32 | 28.60ms | 28.71ms | 29.30ms | 30.81ms |
+| 64 | 53.68ms | 53.86ms | 54.23ms | 54.86ms |
+| 128 | 104.21ms | 104.68ms | 105.00ms | 106.19ms |
+| 256 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference Latency
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 8.25ms | 0.8x | 9.32ms | 0.8x | 12.79ms | 0.9x | 14.29ms | 0.8x |
+| 2 | 7.95ms | 0.9x | 8.75ms | 0.8x | 12.31ms | 0.7x | 15.92ms | 0.9x |
+| 4 | 8.52ms | 0.9x | 9.20ms | 0.8x | 10.60ms | 1.0x | 11.23ms | 1.2x |
+| 8 | 8.78ms | 1.1x | 9.31ms | 1.1x | 10.82ms | 1.2x | 12.54ms | 1.2x |
+| 16 | 8.77ms | 1.8x | 9.05ms | 1.8x | 12.81ms | 1.3x | 14.05ms | 1.2x |
+| 32 | 14.03ms | 2.0x | 14.14ms | 2.0x | 14.92ms | 2.0x | 15.06ms | 2.0x |
+| 64 | 25.91ms | 2.1x | 26.05ms | 2.1x | 26.17ms | 2.1x | 27.17ms | 2.0x |
+| 128 | 50.11ms | 2.1x | 50.28ms | 2.1x | 50.68ms | 2.1x | 51.43ms | 2.1x |
+| 256 | 96.70ms | N/A | 96.91ms | N/A | 97.14ms | N/A | 98.04ms | N/A |
+
+###### FP32 Inference throughput
+
+| **batch_size** | **FP32 @50.0%** | **FP32 @90.0%** | **FP32 @99.0%** | **FP32 @100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 140 img/s | 133 img/s | 89 img/s | 70 img/s |
+| 2 | 271 img/s | 259 img/s | 208 img/s | 132 img/s |
+| 4 | 531 img/s | 506 img/s | 325 img/s | 248 img/s |
+| 8 | 782 img/s | 729 img/s | 523 img/s | 513 img/s |
+| 16 | 992 img/s | 970 img/s | 832 img/s | 624 img/s |
+| 32 | 1101 img/s | 1093 img/s | 963 img/s | 871 img/s |
+| 64 | 1179 img/s | 1161 img/s | 1102 img/s | 1090 img/s |
+| 128 | 1220 img/s | 1213 img/s | 1159 img/s | 1148 img/s |
+| 256 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference throughput
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 118 img/s | 0.8x | 104 img/s | 0.8x | 76 img/s | 0.9x | 66 img/s | 0.9x |
+| 2 | 244 img/s | 0.9x | 220 img/s | 0.8x | 153 img/s | 0.7x | 123 img/s | 0.9x |
+| 4 | 455 img/s | 0.9x | 423 img/s | 0.8x | 367 img/s | 1.1x | 275 img/s | 1.1x |
+| 8 | 886 img/s | 1.1x | 836 img/s | 1.1x | 626 img/s | 1.2x | 471 img/s | 0.9x |
+| 16 | 1771 img/s | 1.8x | 1713 img/s | 1.8x | 1100 img/s | 1.3x | 905 img/s | 1.5x |
+| 32 | 2217 img/s | 2.0x | 1949 img/s | 1.8x | 1619 img/s | 1.7x | 1385 img/s | 1.6x |
+| 64 | 2416 img/s | 2.0x | 2212 img/s | 1.9x | 1993 img/s | 1.8x | 1985 img/s | 1.8x |
+| 128 | 2524 img/s | 2.1x | 2287 img/s | 1.9x | 2046 img/s | 1.8x | 1503 img/s | 1.3x |
+| 256 | 2626 img/s | N/A | 2149 img/s | N/A | 1533 img/s | N/A | 1346 img/s | N/A |
+
+##### NVIDIA T4
+
+###### FP32 Inference Latency
+
+| **batch_size** | **FP32 50.0%** | **FP32 90.0%** | **FP32 99.0%** | **FP32 100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 5.19ms | 5.65ms | 10.97ms | 14.06ms |
+| 2 | 5.39ms | 5.95ms | 9.81ms | 13.17ms |
+| 4 | 6.65ms | 7.34ms | 9.65ms | 13.26ms |
+| 8 | 10.13ms | 10.33ms | 13.87ms | 16.51ms |
+| 16 | 16.76ms | 17.15ms | 21.06ms | 25.66ms |
+| 32 | 31.02ms | 31.12ms | 32.41ms | 34.93ms |
+| 64 | 57.60ms | 57.84ms | 58.05ms | 59.69ms |
+| 128 | 110.91ms | 111.15ms | 112.16ms | 112.20ms |
+| 256 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference Latency
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 5.75ms | 0.9x | 5.92ms | 1.0x | 11.58ms | 0.9x | 15.17ms | 0.9x |
+| 2 | 5.66ms | 1.0x | 6.05ms | 1.0x | 11.52ms | 0.9x | 18.27ms | 0.7x |
+| 4 | 5.86ms | 1.1x | 6.33ms | 1.2x | 8.90ms | 1.1x | 11.89ms | 1.1x |
+| 8 | 6.43ms | 1.6x | 7.31ms | 1.4x | 12.41ms | 1.1x | 13.12ms | 1.3x |
+| 16 | 8.85ms | 1.9x | 9.86ms | 1.7x | 17.01ms | 1.2x | 19.01ms | 1.3x |
+| 32 | 15.42ms | 2.0x | 15.61ms | 2.0x | 18.66ms | 1.7x | 29.76ms | 1.2x |
+| 64 | 28.50ms | 2.0x | 28.69ms | 2.0x | 31.06ms | 1.9x | 34.26ms | 1.7x |
+| 128 | 54.82ms | 2.0x | 54.96ms | 2.0x | 55.27ms | 2.0x | 60.48ms | 1.9x |
+| 256 | 106.47ms | N/A | 106.62ms | N/A | 107.03ms | N/A | 111.27ms | N/A |
+
+###### FP32 Inference throughput
+
+| **batch_size** | **FP32 @50.0%** | **FP32 @90.0%** | **FP32 @99.0%** | **FP32 @100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 186 img/s | 171 img/s | 79 img/s | 57 img/s |
+| 2 | 359 img/s | 320 img/s | 122 img/s | 113 img/s |
+| 4 | 570 img/s | 529 img/s | 391 img/s | 204 img/s |
+| 8 | 757 img/s | 707 img/s | 479 img/s | 432 img/s |
+| 16 | 918 img/s | 899 img/s | 750 img/s | 615 img/s |
+| 32 | 1017 img/s | 1011 img/s | 932 img/s | 756 img/s |
+| 64 | 1101 img/s | 1096 img/s | 1034 img/s | 1015 img/s |
+| 128 | 1148 img/s | 1145 img/s | 1096 img/s | 874 img/s |
+| 256 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference throughput
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 169 img/s | 0.9x | 163 img/s | 1.0x | 79 img/s | 1.0x | 55 img/s | 1.0x |
+| 2 | 343 img/s | 1.0x | 311 img/s | 1.0x | 122 img/s | 1.0x | 107 img/s | 0.9x |
+| 4 | 662 img/s | 1.2x | 612 img/s | 1.2x | 430 img/s | 1.1x | 215 img/s | 1.1x |
+| 8 | 1207 img/s | 1.6x | 1055 img/s | 1.5x | 601 img/s | 1.3x | 384 img/s | 0.9x |
+| 16 | 1643 img/s | 1.8x | 1552 img/s | 1.7x | 908 img/s | 1.2x | 824 img/s | 1.3x |
+| 32 | 1919 img/s | 1.9x | 1674 img/s | 1.7x | 1393 img/s | 1.5x | 1021 img/s | 1.4x |
+| 64 | 2201 img/s | 2.0x | 1772 img/s | 1.6x | 1569 img/s | 1.5x | 1342 img/s | 1.3x |
+| 128 | 2311 img/s | 2.0x | 1833 img/s | 1.6x | 1261 img/s | 1.2x | 1107 img/s | 1.3x |
+| 256 | 2389 img/s | N/A | 1841 img/s | N/A | 1280 img/s | N/A | 1164 img/s | N/A |
+
+
+
+## Release notes
+
+### Changelog
+
+1. September 2018
+  * Initial release
+2. January 2019
+  * Added options Label Smoothing, fan-in initialization, skipping weight decay on batch norm gamma and bias.
+3. May 2019
+  * Cosine LR schedule
+  * MixUp regularization
+  * DALI support
+  * DGX2 configurations
+  * gradients accumulation
+4. July 2019
+  * DALI-CPU dataloader
+  * Updated README
+
+### Known issues
+
+There are no known issues with this model.
+
+

TEMPAT SAMPAH
PyTorch/Classification/ConvNets/resnet50v1.5/img/loss_plot.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/resnet50v1.5/img/top1_plot.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/resnet50v1.5/img/top5_plot.png


+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_50E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP16_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP16_50E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 50

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP16_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 90

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_50E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_50E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP16_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP16_50E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 50

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP16_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 90

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_50E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90

+ 476 - 0
PyTorch/Classification/ConvNets/resnext101-32x4d/README.md

@@ -0,0 +1,476 @@
+# ResNeXt101-32x4d For PyTorch
+
+## Table Of Contents
+* [Model overview](#model-overview)
+  * [Default configuration](#default-configuration)
+  * [Mixed precision training](#mixed-precision-training)
+    * [Enabling mixed precision](#enabling-mixed-precision)
+* [Setup](#setup)
+  * [Requirements](#requirements)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+* [Performance](#performance)
+  * [Benchmarking](#benchmarking)
+    * [Training performance benchmark](#training-performance-benchmark)
+    * [Inference performance benchmark](#inference-performance-benchmark)
+  * [Results](#results)
+    * [Training accuracy results](#training-accuracy-results)
+      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+      * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
+    * [Training performance results](#training-performance-results)
+      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+      * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
+    * [Inference performance results](#inference-performance-results)
+* [Release notes](#release-notes)
+  * [Changelog](#changelog)
+  * [Known issues](#known-issues)
+
+## Model overview
+
+The ResNeXt101-32x4d is a model introduced in [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/pdf/1611.05431.pdf) paper.
+
+It is based on regular ResNet model, substituting 3x3 Convolutions inside the bottleneck block for 3x3 Grouped Convolutions.
+
+![ResNextArch](./img/ResNeXtArch.png)
+
+_ Image source: [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/pdf/1611.05431.pdf) _
+
+ResNeXt101-32x4d model's cardinality equals to 32 and bottleneck width equals to 4
+
+### Default configuration
+
+#### Optimizer
+
+This model uses SGD with momentum optimizer with the following hyperparameters:
+
+* momentum (0.875)
+
+* Learning rate = 0.256 for 256 batch size, for other batch sizes we lineary
+scale the learning rate.
+
+* Learning rate schedule - we use cosine LR schedule
+
+* For bigger batch sizes (512 and up) we use linear warmup of the learning rate
+during first couple of epochs
+according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
+Warmup length depends on total training length.
+
+* Weight decay: 6.103515625e-05 (1/16384).
+
+* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
+
+* Label Smoothing: 0.1
+
+* We train for:
+
+    * 90 Epochs -> 90 epochs is a standard for ImageNet networks
+
+    * 250 Epochs -> best possible accuracy.
+
+* For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
+
+
+#### Data augmentation
+
+This model uses the following data augmentation:
+
+* For training:
+  * Normalization
+  * Random resized crop to 224x224
+    * Scale from 8% to 100%
+    * Aspect ratio from 3/4 to 4/3
+  * Random horizontal flip
+
+* For inference:
+  * Normalization
+  * Scale to 256x256
+  * Center crop to 224x224
+
+### DALI
+
+For DGX2 configurations we use [NVIDIA DALI](https://github.com/NVIDIA/DALI),
+which speeds up data loading when CPU becomes a bottleneck.
+DALI can also use CPU, and it outperforms the pytorch native dataloader.
+
+Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
+For DGX1 we recommend `--data-backends dali-cpu`, for DGX2 we recommend `--data-backends dali-gpu`.
+
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+#### Enabling mixed precision
+
+Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP),  library from [APEX](https://github.com/NVIDIA/apex) that casts variables to half-precision upon retrieval,
+while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients.
+In PyTorch, loss scaling can be easily applied by using scale_loss() method provided by AMP. The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
+
+For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
+
+To enable mixed precision, you can:
+- Import AMP from APEX, for example:
+
+  ```
+  from apex import amp
+  ```
+- Initialize an AMP handle, for example:
+
+  ```
+  amp_handle = amp.init(enabled=True, verbose=True)
+  ```
+- Wrap your optimizer with the AMP handle, for example:
+
+  ```
+  optimizer = amp_handle.wrap_optimizer(optimizer)
+  ```
+- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
+  - Default backpropagate for FP32:
+
+    ```
+    losses.backward()
+    ```
+  - Scale loss and backpropagate with AMP:
+
+    ```
+    with optimizer.scale_loss(losses) as scaled_losses:
+       scaled_losses.backward()
+    ```
+
+## Setup
+
+### Requirements
+
+Ensure you meet the following requirements:
+
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* (optional) NVIDIA Volta GPU (see section below) - for best training performance using mixed precision
+
+For more information about how to get started with NGC containers, see the
+following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
+DGX Documentation:
+* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
+* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
+
+## Quick Start Guide
+
+### 1. Clone the repository.
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/Classification/RNXT101-32x4d/
+```
+
+### 2. Download and preprocess the dataset.
+
+The ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
+
+PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
+
+1. Download the images from http://image-net.org/download-images
+
+2. Extract the training data:
+  ```bash
+  mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
+  tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
+  find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
+  cd ..
+  ```
+
+3. Extract the validation data and move the images to subfolders:
+  ```bash
+  mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
+  wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
+  ```
+
+The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
+
+### 3. Build the RNXT101-32x4d PyTorch NGC container.
+
+```
+docker build . -t nvidia_rnxt101-32x4d
+```
+
+### 4. Start an interactive session in the NGC container to run training/inference.
+```
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_rnxt101-32x4d
+```
+
+### 5. Running training
+
+To run training for a standard configuration (DGX1V/DGX2V, AMP/FP32, 90/250 Epochs),
+run one of the scripts in the `./resnext101-32x4d/training` directory
+called `./resnext101-32x4d/training/{DGX1, DGX2}_RNXT101-32x4d_{AMP, FP32}_{90,250}E.sh`.
+
+Ensure imagenet is mounted in the `/imagenet` directory.
+
+Example:
+    `bash ./resnext101-32x4d/training/DGX1_RNXT101-32x4d_FP16_250E.sh`
+   
+To run a non standard configuration use:
+
+* For 1 GPU
+    * FP32
+        `python ./main.py --arch resnext101-32x4d -c fanin --label-smoothing 0.1 <path to imagenet>`
+    * AMP
+        `python ./main.py --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --amp --static-loss-scale 256 <path to imagenet>`
+
+* For multiple GPUs
+    * FP32
+        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -c fanin --label-smoothing 0.1 <path to imagenet>`
+    * AMP
+        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --amp --static-loss-scale 256 <path to imagenet>`
+
+Use `python ./main.py -h` to obtain the list of available options in the `main.py` script.
+
+### 6. Running inference
+
+To run inference on a checkpointed model run:
+
+`python ./main.py --arch resnext101-32x4d --evaluate --epochs 1 --resume <path to checkpoint> -b <batch size> <path to imagenet>`
+
+## Advanced
+
+### Commmand-line options:
+
+```
+```
+
+## Performance
+
+### Benchmarking
+
+#### Training performance benchmark
+
+To benchmark training, run:
+
+* For 1 GPU
+    * FP32
+`python ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * AMP
+`python ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
+* For multiple GPUs
+    * FP32
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * AMP
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+
+Each of this scripts will run 100 iterations and save results in benchmark.json file
+
+#### Inference performance benchmark
+
+To benchmark inference, run:
+
+* FP32
+
+`python ./main.py --arch resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
+
+* AMP
+
+`python ./main.py --arch resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
+
+Each of this scripts will run 100 iterations and save results in benchmark.json file
+
+
+
+### Results
+
+#### Training Accuracy Results
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
+| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
+|:-:|:-:|:-:|
+| 90 | 79.23 +/- 0.09 | 79.23 +/- 0.09 |
+| 250 | 79.92 +/- 0.13 | 80.06 +/- 0.06 |
+
+##### NVIDIA DGX-2 (16x V100 32G)
+
+No Data
+
+
+
+##### Example plots (90 Epochs configuration on DGX1V)
+
+![ValidationLoss](./img/loss_plot.png)
+
+![ValidationTop1](./img/top1_plot.png)
+
+![ValidationTop5](./img/top5_plot.png)
+
+#### Training Performance Results
+
+##### NVIDIA DGX1-16G (8x V100 16G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 313.43 img/s | 146.66 img/s | 2.14x | 1.00x | 1.00x |
+| 8 | 2384.85 img/s | 1116.58 img/s | 2.14x | 7.61x | 7.61x |
+
+##### NVIDIA DGX1-32G (8x V100 32G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 297.83 img/s | 143.27 img/s | 2.08x | 1.00x | 1.00x |
+| 8 | 2270.85 img/s | 1104.62 img/s | 2.06x | 7.62x | 7.71x |
+
+##### NVIDIA DGX2 (16x V100 32G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 308.42 img/s | 151.67 img/s | 2.03x | 1.00x | 1.00x |
+| 16 | 4473.37 img/s | 2261.97 img/s | 1.98x | 14.50x | 14.91x |
+
+#### Training Time for 90 Epochs
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
+| **# of GPUs** | **Mixed Precision training time** | **FP32 training time** |
+|:-:|:-:|:-:|
+| 1 | ~ 114 h | ~ 242 h |
+| 8 | ~ 17 h | ~ 34 h |
+
+##### NVIDIA DGX-2 (16x V100 32G)
+
+| **# of GPUs** | **Mixed Precision training time** | **FP32 training time** |
+|:-:|:-:|:-:|
+| 1 | ~ 116 h | ~ 234 h |
+| 16 | ~ 10 h | ~ 18 h |
+
+
+
+#### Inference Performance Results
+
+##### NVIDIA VOLTA V100 16G on DGX1V
+
+###### FP32 Inference Latency
+
+| **batch_size** | **FP32 50.0%** | **FP32 90.0%** | **FP32 99.0%** | **FP32 100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 20.53ms | 23.41ms | 26.00ms | 28.14ms |
+| 2 | 21.94ms | 22.90ms | 26.59ms | 30.53ms |
+| 4 | 22.08ms | 24.96ms | 26.03ms | 26.91ms |
+| 8 | 24.03ms | 25.17ms | 28.52ms | 32.59ms |
+| 16 | 39.73ms | 40.01ms | 40.32ms | 44.05ms |
+| 32 | 73.53ms | 74.05ms | 74.26ms | 78.31ms |
+| 64 | 130.88ms | 131.38ms | 131.81ms | 134.32ms |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference Latency
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 22.64ms | 0.9x | 25.19ms | 0.9x | 26.63ms | 1.0x | 28.43ms | 1.0x |
+| 2 | 23.35ms | 0.9x | 25.11ms | 0.9x | 27.29ms | 1.0x | 27.95ms | 1.1x |
+| 4 | 22.35ms | 1.0x | 24.38ms | 1.0x | 25.92ms | 1.0x | 27.09ms | 1.0x |
+| 8 | 23.35ms | 1.0x | 26.45ms | 1.0x | 27.74ms | 1.0x | 28.22ms | 1.2x |
+| 16 | 24.77ms | 1.6x | 26.93ms | 1.5x | 28.73ms | 1.4x | 29.07ms | 1.5x |
+| 32 | 35.70ms | 2.1x | 35.96ms | 2.1x | 36.13ms | 2.1x | 36.36ms | 2.2x |
+| 64 | 63.40ms | 2.1x | 63.63ms | 2.1x | 63.96ms | 2.1x | 64.74ms | 2.1x |
+| 128 | 117.52ms | N/A | 118.02ms | N/A | 118.35ms | N/A | 118.43ms | N/A |
+
+###### FP32 Inference throughput
+
+| **batch_size** | **FP32 @50.0%** | **FP32 @90.0%** | **FP32 @99.0%** | **FP32 @100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 48 img/s | 42 img/s | 38 img/s | 35 img/s |
+| 2 | 90 img/s | 87 img/s | 74 img/s | 64 img/s |
+| 4 | 179 img/s | 158 img/s | 152 img/s | 147 img/s |
+| 8 | 329 img/s | 314 img/s | 279 img/s | 243 img/s |
+| 16 | 399 img/s | 395 img/s | 389 img/s | 361 img/s |
+| 32 | 433 img/s | 429 img/s | 423 img/s | 403 img/s |
+| 64 | 487 img/s | 485 img/s | 475 img/s | 436 img/s |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference throughput
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 44 img/s | 0.9x | 39 img/s | 0.9x | 37 img/s | 1.0x | 35 img/s | 1.0x |
+| 2 | 85 img/s | 0.9x | 78 img/s | 0.9x | 73 img/s | 1.0x | 71 img/s | 1.1x |
+| 4 | 177 img/s | 1.0x | 163 img/s | 1.0x | 153 img/s | 1.0x | 145 img/s | 1.0x |
+| 8 | 339 img/s | 1.0x | 299 img/s | 1.0x | 286 img/s | 1.0x | 282 img/s | 1.2x |
+| 16 | 640 img/s | 1.6x | 589 img/s | 1.5x | 551 img/s | 1.4x | 547 img/s | 1.5x |
+| 32 | 887 img/s | 2.0x | 879 img/s | 2.1x | 846 img/s | 2.0x | 731 img/s | 1.8x |
+| 64 | 1001 img/s | 2.1x | 996 img/s | 2.1x | 978 img/s | 2.1x | 797 img/s | 1.8x |
+| 128 | 1081 img/s | N/A | 1078 img/s | N/A | 1068 img/s | N/A | 767 img/s | N/A |
+
+##### NVIDIA T4
+
+###### FP32 Inference Latency
+
+| **batch_size** | **FP32 50.0%** | **FP32 90.0%** | **FP32 99.0%** | **FP32 100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 17.77ms | 19.21ms | 22.29ms | 24.47ms |
+| 2 | 17.83ms | 19.00ms | 22.51ms | 29.68ms |
+| 4 | 18.02ms | 18.88ms | 21.74ms | 26.49ms |
+| 8 | 26.14ms | 27.35ms | 28.93ms | 29.46ms |
+| 16 | 45.40ms | 45.72ms | 47.43ms | 48.93ms |
+| 32 | 79.07ms | 79.37ms | 81.83ms | 82.45ms |
+| 64 | 140.12ms | 140.73ms | 143.57ms | 149.46ms |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference Latency
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 18.99ms | 0.9x | 20.16ms | 1.0x | 26.21ms | 0.9x | 29.29ms | 0.8x |
+| 2 | 19.24ms | 0.9x | 19.77ms | 1.0x | 24.51ms | 0.9x | 30.18ms | 1.0x |
+| 4 | 18.81ms | 1.0x | 19.52ms | 1.0x | 21.95ms | 1.0x | 23.45ms | 1.1x |
+| 8 | 18.96ms | 1.4x | 21.12ms | 1.3x | 25.77ms | 1.1x | 28.05ms | 1.1x |
+| 16 | 23.27ms | 2.0x | 25.19ms | 1.8x | 27.31ms | 1.7x | 28.11ms | 1.7x |
+| 32 | 39.22ms | 2.0x | 39.43ms | 2.0x | 41.96ms | 2.0x | 44.25ms | 1.9x |
+| 64 | 71.70ms | 2.0x | 71.87ms | 2.0x | 72.78ms | 2.0x | 77.22ms | 1.9x |
+| 128 | 134.17ms | N/A | 134.40ms | N/A | 134.81ms | N/A | 135.26ms | N/A |
+
+###### FP32 Inference throughput
+
+| **batch_size** | **FP32 @50.0%** | **FP32 @90.0%** | **FP32 @99.0%** | **FP32 @100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 56 img/s | 51 img/s | 44 img/s | 40 img/s |
+| 2 | 111 img/s | 103 img/s | 87 img/s | 66 img/s |
+| 4 | 220 img/s | 210 img/s | 182 img/s | 149 img/s |
+| 8 | 301 img/s | 290 img/s | 275 img/s | 270 img/s |
+| 16 | 351 img/s | 348 img/s | 336 img/s | 325 img/s |
+| 32 | 402 img/s | 401 img/s | 389 img/s | 376 img/s |
+| 64 | 451 img/s | 446 img/s | 429 img/s | 398 img/s |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference throughput
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 52 img/s | 0.9x | 49 img/s | 1.0x | 38 img/s | 0.8x | 34 img/s | 0.8x |
+| 2 | 103 img/s | 0.9x | 100 img/s | 1.0x | 81 img/s | 0.9x | 65 img/s | 1.0x |
+| 4 | 210 img/s | 1.0x | 203 img/s | 1.0x | 181 img/s | 1.0x | 169 img/s | 1.1x |
+| 8 | 418 img/s | 1.4x | 375 img/s | 1.3x | 304 img/s | 1.1x | 283 img/s | 1.0x |
+| 16 | 673 img/s | 1.9x | 630 img/s | 1.8x | 581 img/s | 1.7x | 565 img/s | 1.7x |
+| 32 | 807 img/s | 2.0x | 800 img/s | 2.0x | 705 img/s | 1.8x | 681 img/s | 1.8x |
+| 64 | 887 img/s | 2.0x | 884 img/s | 2.0x | 799 img/s | 1.9x | 696 img/s | 1.7x |
+| 128 | 950 img/s | N/A | 948 img/s | N/A | 918 img/s | N/A | 779 img/s | N/A |
+
+
+
+## Release notes
+
+### Changelog
+
+1. October 2019
+  * Initial release
+
+### Known issues
+
+There are no known issues with this model.
+
+

TEMPAT SAMPAH
PyTorch/Classification/ConvNets/resnext101-32x4d/img/ResNeXtArch.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/resnext101-32x4d/img/loss_plot.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/resnext101-32x4d/img/top1_plot.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/resnext101-32x4d/img/top5_plot.png


+ 1 - 0
PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05

+ 1 - 0
PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05

+ 476 - 0
PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md

@@ -0,0 +1,476 @@
+# SE-ResNeXt101-32x4d For PyTorch
+
+## Table Of Contents
+* [Model overview](#model-overview)
+  * [Default configuration](#default-configuration)
+  * [Mixed precision training](#mixed-precision-training)
+    * [Enabling mixed precision](#enabling-mixed-precision)
+* [Setup](#setup)
+  * [Requirements](#requirements)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+* [Performance](#performance)
+  * [Benchmarking](#benchmarking)
+    * [Training performance benchmark](#training-performance-benchmark)
+    * [Inference performance benchmark](#inference-performance-benchmark)
+  * [Results](#results)
+    * [Training accuracy results](#training-accuracy-results)
+      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+      * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
+    * [Training performance results](#training-performance-results)
+      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+      * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
+    * [Inference performance results](#inference-performance-results)
+* [Release notes](#release-notes)
+  * [Changelog](#changelog)
+  * [Known issues](#known-issues)
+
+## Model overview
+
+The SE-ResNeXt101-32x4d is a [ResNeXt101-32x4d](https://arxiv.org/pdf/1611.05431.pdf)
+model with added Squeeze-and-Excitation module introduced
+in [Squeeze-and-Excitation Networks](https://arxiv.org/pdf/1709.01507.pdf) paper.
+
+Squeeze and Excitation module architecture for ResNet-type models:
+
+![SEArch](./img/SEArch.png)
+
+_ Image source: [Squeeze-and-Excitation Networks](https://arxiv.org/pdf/1709.01507.pdf) _
+
+### Default configuration
+
+#### Optimizer
+
+This model uses SGD with momentum optimizer with the following hyperparameters:
+
+* momentum (0.875)
+
+* Learning rate = 0.256 for 256 batch size, for other batch sizes we lineary
+scale the learning rate.
+
+* Learning rate schedule - we use cosine LR schedule
+
+* For bigger batch sizes (512 and up) we use linear warmup of the learning rate
+during first couple of epochs
+according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
+Warmup length depends on total training length.
+
+* Weight decay: 6.103515625e-05 (1/16384).
+
+* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
+
+* Label Smoothing: 0.1
+
+* We train for:
+
+    * 90 Epochs -> 90 epochs is a standard for ImageNet networks
+
+    * 250 Epochs -> best possible accuracy.
+
+* For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
+
+
+#### Data augmentation
+
+This model uses the following data augmentation:
+
+* For training:
+  * Normalization
+  * Random resized crop to 224x224
+    * Scale from 8% to 100%
+    * Aspect ratio from 3/4 to 4/3
+  * Random horizontal flip
+
+* For inference:
+  * Normalization
+  * Scale to 256x256
+  * Center crop to 224x224
+
+### DALI
+
+For DGX2 configurations we use [NVIDIA DALI](https://github.com/NVIDIA/DALI),
+which speeds up data loading when CPU becomes a bottleneck.
+DALI can also use CPU, and it outperforms the pytorch native dataloader.
+
+Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
+For DGX1 we recommend `--data-backends dali-cpu`, for DGX2 we recommend `--data-backends dali-gpu`.
+
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+#### Enabling mixed precision
+
+Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP),  library from [APEX](https://github.com/NVIDIA/apex) that casts variables to half-precision upon retrieval,
+while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients.
+In PyTorch, loss scaling can be easily applied by using scale_loss() method provided by AMP. The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
+
+For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
+
+To enable mixed precision, you can:
+- Import AMP from APEX, for example:
+
+  ```
+  from apex import amp
+  ```
+- Initialize an AMP handle, for example:
+
+  ```
+  amp_handle = amp.init(enabled=True, verbose=True)
+  ```
+- Wrap your optimizer with the AMP handle, for example:
+
+  ```
+  optimizer = amp_handle.wrap_optimizer(optimizer)
+  ```
+- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
+  - Default backpropagate for FP32:
+
+    ```
+    losses.backward()
+    ```
+  - Scale loss and backpropagate with AMP:
+
+    ```
+    with optimizer.scale_loss(losses) as scaled_losses:
+       scaled_losses.backward()
+    ```
+
+## Setup
+
+### Requirements
+
+Ensure you meet the following requirements:
+
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* (optional) NVIDIA Volta GPU (see section below) - for best training performance using mixed precision
+
+For more information about how to get started with NGC containers, see the
+following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
+DGX Documentation:
+* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
+* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
+
+## Quick Start Guide
+
+### 1. Clone the repository.
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/Classification/SE-RNXT101-32x4d/
+```
+
+### 2. Download and preprocess the dataset.
+
+The SE-ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
+
+PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
+
+1. Download the images from http://image-net.org/download-images
+
+2. Extract the training data:
+  ```bash
+  mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
+  tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
+  find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
+  cd ..
+  ```
+
+3. Extract the validation data and move the images to subfolders:
+  ```bash
+  mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
+  wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
+  ```
+
+The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
+
+### 3. Build the SE-RNXT101-32x4d PyTorch NGC container.
+
+```
+docker build . -t nvidia_rnxt101-32x4d
+```
+
+### 4. Start an interactive session in the NGC container to run training/inference.
+```
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_rnxt101-32x4d
+```
+
+### 5. Running training
+
+To run training for a standard configuration (DGX1V/DGX2V, AMP/FP32, 90/250 Epochs),
+run one of the scripts in the `./se-resnext101-32x4d/training` directory
+called `./se-resnext101-32x4d/training/{DGX1, DGX2}_SE-RNXT101-32x4d_{AMP, FP32}_{90,250}E.sh`.
+
+Ensure imagenet is mounted in the `/imagenet` directory.
+
+Example:
+    `bash ./se-resnext101-32x4d/training/DGX1_SE-RNXT101-32x4d_FP16_250E.sh`
+   
+To run a non standard configuration use:
+
+* For 1 GPU
+    * FP32
+        `python ./main.py --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 <path to imagenet>`
+    * AMP
+        `python ./main.py --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --amp --static-loss-scale 256 <path to imagenet>`
+
+* For multiple GPUs
+    * FP32
+        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 <path to imagenet>`
+    * AMP
+        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --amp --static-loss-scale 256 <path to imagenet>`
+
+Use `python ./main.py -h` to obtain the list of available options in the `main.py` script.
+
+### 6. Running inference
+
+To run inference on a checkpointed model run:
+
+`python ./main.py --arch se-resnext101-32x4d --evaluate --epochs 1 --resume <path to checkpoint> -b <batch size> <path to imagenet>`
+
+## Advanced
+
+### Commmand-line options:
+
+```
+```
+
+## Performance
+
+### Benchmarking
+
+#### Training performance benchmark
+
+To benchmark training, run:
+
+* For 1 GPU
+    * FP32
+`python ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * AMP
+`python ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
+* For multiple GPUs
+    * FP32
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
+    * AMP
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+
+Each of this scripts will run 100 iterations and save results in benchmark.json file
+
+#### Inference performance benchmark
+
+To benchmark inference, run:
+
+* FP32
+
+`python ./main.py --arch se-resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
+
+* AMP
+
+`python ./main.py --arch se-resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
+
+Each of this scripts will run 100 iterations and save results in benchmark.json file
+
+
+
+### Results
+
+#### Training Accuracy Results
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
+| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
+|:-:|:-:|:-:|
+| 90 | 80.03 +/- 0.10 | 79.86 +/- 0.13 |
+| 250 | 80.96 +/- 0.04 | 80.97 +/- 0.09 |
+
+##### NVIDIA DGX-2 (16x V100 32G)
+
+No Data
+
+
+
+##### Example plots (90 Epochs configuration on DGX1V)
+
+![ValidationLoss](./img/loss_plot.png)
+
+![ValidationTop1](./img/top1_plot.png)
+
+![ValidationTop5](./img/top5_plot.png)
+
+#### Training Performance Results
+
+##### NVIDIA DGX1-16G (8x V100 16G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 266.65 img/s | 128.23 img/s | 2.08x | 1.00x | 1.00x |
+| 8 | 2031.17 img/s | 977.45 img/s | 2.08x | 7.62x | 7.62x |
+
+##### NVIDIA DGX1-32G (8x V100 32G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 255.22 img/s | 125.13 img/s | 2.04x | 1.00x | 1.00x |
+| 8 | 1959.35 img/s | 963.21 img/s | 2.03x | 7.68x | 7.70x |
+
+##### NVIDIA DGX2 (16x V100 32G)
+
+| **# of GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
+|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 261.58 img/s | 130.85 img/s | 2.00x | 1.00x | 1.00x |
+| 16 | 3776.03 img/s | 1953.13 img/s | 1.93x | 14.44x | 14.93x |
+
+#### Training Time for 90 Epochs
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
+| **# of GPUs** | **Mixed Precision training time** | **FP32 training time** |
+|:-:|:-:|:-:|
+| 1 | ~ 134 h | ~ 277 h |
+| 8 | ~ 19 h | ~ 38 h |
+
+##### NVIDIA DGX-2 (16x V100 32G)
+
+| **# of GPUs** | **Mixed Precision training time** | **FP32 training time** |
+|:-:|:-:|:-:|
+| 1 | ~ 137 h | ~ 271 h |
+| 16 | ~ 11 h | ~ 20 h |
+
+
+
+#### Inference Performance Results
+
+##### NVIDIA VOLTA V100 16G on DGX1V
+
+###### FP32 Inference Latency
+
+| **batch_size** | **FP32 50.0%** | **FP32 90.0%** | **FP32 99.0%** | **FP32 100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 28.92ms | 30.92ms | 34.65ms | 40.01ms |
+| 2 | 29.38ms | 31.30ms | 34.79ms | 40.65ms |
+| 4 | 28.97ms | 29.78ms | 33.90ms | 42.28ms |
+| 8 | 29.75ms | 32.73ms | 35.61ms | 39.83ms |
+| 16 | 44.52ms | 44.93ms | 46.90ms | 48.84ms |
+| 32 | 80.63ms | 81.28ms | 82.69ms | 85.82ms |
+| 64 | 142.57ms | 142.99ms | 145.01ms | 148.87ms |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference Latency
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 27.85ms | 1.0x | 29.75ms | 1.0x | 35.85ms | 1.0x | 42.54ms | 0.9x |
+| 2 | 28.37ms | 1.0x | 30.24ms | 1.0x | 37.07ms | 0.9x | 42.81ms | 0.9x |
+| 4 | 29.73ms | 1.0x | 31.39ms | 0.9x | 37.17ms | 0.9x | 43.78ms | 1.0x |
+| 8 | 30.19ms | 1.0x | 31.20ms | 1.0x | 34.46ms | 1.0x | 42.87ms | 0.9x |
+| 16 | 30.92ms | 1.4x | 32.48ms | 1.4x | 36.49ms | 1.3x | 42.76ms | 1.1x |
+| 32 | 40.61ms | 2.0x | 40.90ms | 2.0x | 43.67ms | 1.9x | 44.88ms | 1.9x |
+| 64 | 72.04ms | 2.0x | 72.29ms | 2.0x | 76.46ms | 1.9x | 77.46ms | 1.9x |
+| 128 | 130.12ms | N/A | 130.34ms | N/A | 131.12ms | N/A | 140.27ms | N/A |
+
+###### FP32 Inference throughput
+
+| **batch_size** | **FP32 @50.0%** | **FP32 @90.0%** | **FP32 @99.0%** | **FP32 @100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 34 img/s | 32 img/s | 29 img/s | 25 img/s |
+| 2 | 68 img/s | 63 img/s | 57 img/s | 48 img/s |
+| 4 | 137 img/s | 133 img/s | 117 img/s | 93 img/s |
+| 8 | 267 img/s | 243 img/s | 223 img/s | 198 img/s |
+| 16 | 357 img/s | 354 img/s | 331 img/s | 325 img/s |
+| 32 | 392 img/s | 389 img/s | 381 img/s | 361 img/s |
+| 64 | 444 img/s | 442 img/s | 434 img/s | 426 img/s |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference throughput
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 36 img/s | 1.0x | 33 img/s | 1.0x | 28 img/s | 1.0x | 23 img/s | 0.9x |
+| 2 | 70 img/s | 1.0x | 66 img/s | 1.0x | 53 img/s | 0.9x | 46 img/s | 1.0x |
+| 4 | 133 img/s | 1.0x | 126 img/s | 0.9x | 107 img/s | 0.9x | 90 img/s | 1.0x |
+| 8 | 263 img/s | 1.0x | 254 img/s | 1.0x | 226 img/s | 1.0x | 184 img/s | 0.9x |
+| 16 | 513 img/s | 1.4x | 488 img/s | 1.4x | 435 img/s | 1.3x | 369 img/s | 1.1x |
+| 32 | 781 img/s | 2.0x | 775 img/s | 2.0x | 723 img/s | 1.9x | 680 img/s | 1.9x |
+| 64 | 882 img/s | 2.0x | 878 img/s | 2.0x | 818 img/s | 1.9x | 777 img/s | 1.8x |
+| 128 | 978 img/s | N/A | 976 img/s | N/A | 969 img/s | N/A | 891 img/s | N/A |
+
+##### NVIDIA T4
+
+###### FP32 Inference Latency
+
+| **batch_size** | **FP32 50.0%** | **FP32 90.0%** | **FP32 99.0%** | **FP32 100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 23.73ms | 26.94ms | 33.03ms | 35.92ms |
+| 2 | 23.20ms | 24.53ms | 29.42ms | 36.79ms |
+| 4 | 23.82ms | 24.59ms | 27.57ms | 31.07ms |
+| 8 | 29.73ms | 30.51ms | 33.07ms | 34.98ms |
+| 16 | 48.49ms | 48.91ms | 51.01ms | 54.54ms |
+| 32 | 86.81ms | 87.15ms | 90.74ms | 90.89ms |
+| 64 | 155.01ms | 156.07ms | 164.74ms | 167.99ms |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference Latency
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 25.22ms | 0.9x | 26.10ms | 1.0x | 31.72ms | 1.0x | 34.56ms | 1.0x |
+| 2 | 25.18ms | 0.9x | 25.83ms | 0.9x | 33.07ms | 0.9x | 37.80ms | 1.0x |
+| 4 | 24.94ms | 1.0x | 25.58ms | 1.0x | 27.93ms | 1.0x | 30.55ms | 1.0x |
+| 8 | 26.29ms | 1.1x | 27.59ms | 1.1x | 32.69ms | 1.0x | 35.78ms | 1.0x |
+| 16 | 27.63ms | 1.8x | 28.36ms | 1.7x | 34.44ms | 1.5x | 39.55ms | 1.4x |
+| 32 | 44.43ms | 2.0x | 44.69ms | 1.9x | 47.99ms | 1.9x | 51.38ms | 1.8x |
+| 64 | 79.17ms | 2.0x | 79.40ms | 2.0x | 84.34ms | 2.0x | 84.64ms | 2.0x |
+| 128 | 147.41ms | N/A | 149.02ms | N/A | 151.90ms | N/A | 159.28ms | N/A |
+
+###### FP32 Inference throughput
+
+| **batch_size** | **FP32 @50.0%** | **FP32 @90.0%** | **FP32 @99.0%** | **FP32 @100.0%** |
+|:-:|:-:|:-:|:-:|:-:|
+| 1 | 42 img/s | 37 img/s | 30 img/s | 27 img/s |
+| 2 | 86 img/s | 81 img/s | 68 img/s | 54 img/s |
+| 4 | 167 img/s | 161 img/s | 143 img/s | 128 img/s |
+| 8 | 267 img/s | 261 img/s | 240 img/s | 226 img/s |
+| 16 | 328 img/s | 325 img/s | 296 img/s | 289 img/s |
+| 32 | 367 img/s | 365 img/s | 350 img/s | 343 img/s |
+| 64 | 411 img/s | 408 img/s | 380 img/s | 373 img/s |
+| 128 | N/A | N/A | N/A | N/A |
+
+###### Mixed Precision Inference throughput
+
+| **batch_size** | **AMP @50.0%** | **speedup** | **AMP @90.0%** | **speedup** | **AMP @99.0%** | **speedup** | **AMP @100.0%** | **speedup** |
+|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
+| 1 | 39 img/s | 0.9x | 38 img/s | 1.0x | 31 img/s | 1.0x | 29 img/s | 1.0x |
+| 2 | 79 img/s | 0.9x | 77 img/s | 1.0x | 60 img/s | 0.9x | 52 img/s | 1.0x |
+| 4 | 159 img/s | 1.0x | 155 img/s | 1.0x | 142 img/s | 1.0x | 130 img/s | 1.0x |
+| 8 | 302 img/s | 1.1x | 288 img/s | 1.1x | 243 img/s | 1.0x | 222 img/s | 1.0x |
+| 16 | 575 img/s | 1.8x | 560 img/s | 1.7x | 458 img/s | 1.5x | 402 img/s | 1.4x |
+| 32 | 713 img/s | 1.9x | 708 img/s | 1.9x | 619 img/s | 1.8x | 549 img/s | 1.6x |
+| 64 | 804 img/s | 2.0x | 801 img/s | 2.0x | 712 img/s | 1.9x | 636 img/s | 1.7x |
+| 128 | 857 img/s | N/A | 855 img/s | N/A | 840 img/s | N/A | 783 img/s | N/A |
+
+
+
+## Release notes
+
+### Changelog
+
+1. October 2019
+  * Initial release
+
+### Known issues
+
+There are no known issues with this model.
+
+

TEMPAT SAMPAH
PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/SEArch.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/loss_plot.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/top1_plot.png


TEMPAT SAMPAH
PyTorch/Classification/ConvNets/se-resnext101-32x4d/img/top5_plot.png


+ 1 - 0
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05

+ 1 - 0
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_250E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

+ 1 - 0
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05

+ 0 - 8
PyTorch/Classification/RN50v1.5/Dockerfile

@@ -1,8 +0,0 @@
-FROM nvcr.io/nvidia/pytorch:19.05-py3
-
-RUN git clone https://github.com/NVIDIA/apex \
-        && cd apex \
-        && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
-
-ADD . /workspace/rn50
-WORKDIR /workspace/rn50

+ 0 - 311
PyTorch/Classification/RN50v1.5/README.md

@@ -1,311 +0,0 @@
-# ResNet50 v1.5
-
-## The model
-The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).
-
-The difference between v1 and v1.5 is that, in the bottleneck blocks which requires
-downsampling, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution.
-
-This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a smallperformance drawback (~5% imgs/sec).
-
-The model is initialized as described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf)
-
-## Training procedure
-
-### Optimizer
-
-This model trains for 90 epochs, with standard ResNet v1.5 setup:
-
-* SGD with momentum (0.875)
-
-* Learning rate = 0.256 for 256 batch size, for other batch sizes we lineary
-scale the learning rate.
-
-* Learning rate schedule - we use cosine LR schedule
-
-* For bigger batch sizes (512 and up) we use linear warmup of the learning rate
-during first couple of epochs
-according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
-Warmup length depends on total training length.
-
-* Weight decay: 3.0517578125e-05 (1/32768).
-
-* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
-
-* Label Smoothing: 0.1
-
-* We train for:
-
-    * 50 Epochs -> configuration that reaches 75.9% top1 accuracy
-
-    * 90 Epochs -> 90 epochs is a standard for ResNet50
-
-    * 250 Epochs -> best possible accuracy.
-
-* For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
-
-
-### Data Augmentation
-
-This model uses the following data augmentation:
-
-* For training:
-  * Normalization
-  * Random resized crop to 224x224
-    * Scale from 8% to 100%
-    * Aspect ratio from 3/4 to 4/3
-  * Random horizontal flip
-
-* For inference:
-  * Normalization
-  * Scale to 256x256
-  * Center crop to 224x224
-
-### Other training recipes
-
-This script does not targeting any specific benchmark.
-There are changes that others have made which can speed up convergence and/or increase accuracy.
-
-One of the more popular training recipes is provided by [fast.ai](https://github.com/fastai/imagenet-fast).
-
-The fast.ai recipe introduces many changes to the training procedure, one of which is progressive resizing of the training images.
-
-The first part of training uses 128px images, the middle part uses 224px images, and the last part uses 288px images.
-The final validation is performed on 288px images.
-
-Training script in this repository performs validation on 224px images, just like the original paper described.
-
-These two approaches can't be directly compared, since the fast.ai recipe requires validation on 288px images,
-and this recipe keeps the original assumption that validation is done on 224px images.
-
-Using 288px images means that a lot more FLOPs are needed during inference to reach the same accuracy.
-
-
-# Setup
-## Requirements
-
-Ensure you meet the following requirements:
-
-* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.05-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
-* (optional) NVIDIA Volta GPU (see section below) - for best training performance using mixed precision
-
-For more information about how to get started with NGC containers, see the
-following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
-DGX Documentation:
-* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
-* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
-* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
-
-## Training using mixed precision with Tensor Cores
-
-### Hardware requirements
-Training with mixed precision on NVIDIA Tensor Cores, requires an
-[NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)-based GPU.
-
-### Software changes
-
-For information about how to train using mixed precision, see the
-[Mixed Precision Training paper](https://arxiv.org/abs/1710.03740)
-and
-[Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
-
-For PyTorch, easily adding mixed-precision support is available from NVIDIA’s
-[APEX](https://github.com/NVIDIA/apex), a PyTorch extension, that contains
-utility libraries, such as AMP, which require minimal network code changes to
-leverage Tensor Core performance.
-
-### DALI
-
-For DGX2 configurations we use [NVIDIA DALI](https://github.com/NVIDIA/DALI),
-which speeds up data loading when CPU becomes a bottleneck.
-
-Run training with `--data-backends dali-gpu` to enable DALI.
-
-# Quick start guide
-
-## Geting the data
-
-The ResNet50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
-
-PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
-
-1. Download the images from http://image-net.org/download-images
-
-2. Extract the training data:
-  ```bash
-  mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
-  tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
-  find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
-  cd ..
-  ```
-
-3. Extract the validation data and move the images to subfolders:
-  ```bash
-  mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
-  wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
-  ```
-
-The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
-
-## Running training
-
-To run training for a standard configuration (DGX1V/DGX2V, FP16/FP32, 50/90/250 Epochs),
-run one of the scripts in the `./resnet50v1.5/training` directory
-called `./resnet50v1.5/training/{DGX1, DGX2}_RN50_{FP16, FP32}_{50,90,250}E.sh`.
-
-Ensure imagenet is mounted in the `/data/imagenet` directory.
-
-To run a non standard configuration use:
-
-* For 1 GPU
-    * FP32
-        `python ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 <path to imagenet>`
-    * FP16
-        `python ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 --fp16 --static-loss-scale 256 <path to imagenet>`
-
-* For multiple GPUs
-    * FP32
-        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 <path to imagenet>`
-    * FP16
-        `python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -c fanin --label-smoothing 0.1 --fp16 --static-loss-scale 256 <path to imagenet>`
-
-Use `python ./main.py -h` to obtain the list of available options in the `main.py` script.
-
-## Running inference
-
-To run inference on a checkpointed model run:
-
-`python ./main.py --arch resnet50 --evaluate --resume <path to checkpoint> -b <batch size> <path to imagenet>`
-
-## Benchmarking
-
-### Training performance
-
-To benchmark training, run:
-
-* For 1 GPU
-    * FP32
-`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --fp16 --static-loss-scale 256 <path to imagenet>`
-* For multiple GPUs
-    * FP32
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --fp16 --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
-
-Each of this scripts will run 100 iterations and save results in benchmark.json file
-
-### Inference performance
-
-To benchmark inference, run:
-
-* FP32
-
-`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
-
-* FP16
-
-`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --fp16 <path to imagenet>`
-
-Each of this scripts will run 100 iterations and save results in benchmark.json file
-
-## Training Accuracy Results
-
-### NVIDIA DGX1V (8x V100 16G)
-
-#### Accuracy
-
-| **# of epochs** | **mixed precision top1** | **FP32 top1**   |
-|:-----------------:|:------------------------:|:---------------:|
-| 50                | 76.25 +/- 0.04           | 76.26 +/- 0.07  |
-| 90                | 77.23 +/- 0.04           | 77.08 +/- 0.07  |
-| 250               | 78.42 +/- 0.04           | 78.30 +/- 0.16  |
-
-#### Training time for 90 epochs
-
-| **number of GPUs** | **mixed precision training time** | **FP32 training time** |
-|:------------------:|:---------------------------------:|:----------------------:|
-| 1                  | ~46h                              | ~90h                   |
-| 4                  | ~14h                              | ~26h                   |
-| 8                  | ~8h                               | ~14h                   |
-
-### NVIDIA DGX2V (16x V100 32G)
-
-#### Accuracy
-
-| **# of epochs** | **mixed precision top1** | **FP32 top1**   |
-|:-----------------:|:------------------------:|:---------------:|
-| 50                | 75.80 +/- 0.08           | 76.04 +/- 0.05  |
-| 90                | 77.10 +/- 0.06           | 77.23 +/- 0.04  |
-| 250               | 78.59 +/- 0.13           | 78.46 +/- 0.03  |
-
-#### Training time for 90 epochs
-
-| **number of GPUs** | **mixed precision training time** | **FP32 training time** |
-|:------------------:|:---------------------------------:|:----------------------:|
-| 2                  | ~24h                              | ~45h                   |
-| 8                  | ~8h                               | ~13h                   |
-| 16                  | ~4h                               | ~7h                    |
-
-
-### Example plots (250 Epochs configuration on DGX2)
-
-![TrainingLoss](./img/DGX2_250_loss.png)
-
-![ValidationTop1](./img/DGX2_250_top1.png)
-
-![ValidationTop5](./img/DGX2_250_top5.png)
-
-
-## Training Performance Results
-
-### NVIDIA DGX1V (8x V100 16G)
-
-| **number of GPUs** | **mixed precision img/s** | **FP32 img/s** | **mixed precision speedup** | **mixed precision weak scaling** | **FP32 weak scaling** |
-|:------------------:|:-------------------------:|:--------------:|:---------------------------:|:--------------------------------:|:---------------------:|
-| 1                  | 747.3                     | 363.1          | 2.06                        | 1.00                             | 1.00                  |
-| 4                  | 2886.9                    | 1375.5         | 2.1                         | 3.86                             | 3.79                  |
-| 8                  | 5815.8                    | 2857.9         | 2.03                        | 7.78                             | 7.87                  |
-
-### NVIDIA DGX2V (16x V100 32G)
-
-| **number of GPUs** | **mixed precision img/s** | **FP32 img/s** | **mixed precision speedup** |
-|:------------------:|:-------------------------:|:--------------:|:---------------------------:|
-| 16                 | 12086.1                   | 5578.2         | 2.16                        |
-
-
-## Inference Performance Results
-
-### NVIDIA VOLTA V100 16G on DGX1V
-
-| **batch size** | **mixed precision img/s** | **FP32 img/s** |
-|:--------------:|:-------------------------:|:--------------:|
-|       1 |   131.8 |   134.9 |
-|       2 |   248.7 |   260.6 |
-|       4 |   486.4 |   425.5 |
-|       8 |   908.5 |   783.6 |
-|      16 |  1370.6 |   998.9 |
-|      32 |  2287.5 |  1092.3 |
-|      64 |  2476.2 |  1166.6 |
-|     128 |  2615.6 |  1215.6 |
-|     256 |  2696.7 |  N/A    |
-
-# Changelog
-
-1. September 2018
-  * Initial release
-2. January 2019
-  * Added options Label Smoothing, fan-in initialization, skipping weight decay on batch norm gamma and bias.
-3. May 2019
-  * Cosine LR schedule
-  * MixUp regularization
-  * DALI support
-  * DGX2 configurations
-  * gradients accumulation
-
-
-# Known issues
-
-There are no known issues with this model.

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP16_1GPU.sh

@@ -1,4 +0,0 @@
-# This script launches ResNet50 training in FP16 on 1 GPUs using 256 batch size (256 per GPU)
-# Usage ./RN50_FP16_1GPU.sh <path to this repository> <additional flags>
-
-python $1/main.py -j5 -p 500 --arch resnet50 -c fanin --label-smoothing 0.1 -b 256 --lr 0.1 --epochs 90 --fp16 --static-loss-scale 256 $2 /data/imagenet

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP16_4GPU.sh

@@ -1,4 +0,0 @@
-# This script launches ResNet50 training in FP16 on 4 GPUs using 1024 batch size (256 per GPU)
-# Usage ./RN50_FP16_4GPU.sh <path to this repository> <additional flags>
-
-python $1/multiproc.py --nproc_per_node 4 $1/main.py -j5 -p 500 --arch resnet50 -c fanin --label-smoothing 0.1 -b 256 --lr 0.4 --warmup 5 --epochs 90 --fp16 --static-loss-scale 256 $2 /data/imagenet

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP16_8GPU.sh

@@ -1,4 +0,0 @@
-# This script launches ResNet50 training in FP16 on 8 GPUs using 2048 batch size (256 per GPU)
-# Usage ./RN50_FP16_8GPU.sh <path to this repository> <additional flags>
-
-python $1/multiproc.py --nproc_per_node 8 $1/main.py -j5 -p 500 --arch resnet50 -c fanin --label-smoothing 0.1 -b 256 --lr 0.8 --warmup 5 --epochs 90 --fp16 --static-loss-scale 256 $2 /data/imagenet

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP16_EVAL.sh

@@ -1,4 +0,0 @@
-# This script evaluates ResNet50 model in FP16 using 64 batch size on 1 GPU
-# Usage: ./RN50_FP16_EVAL.sh <path to this repository> <path to checkpoint>
-
-python $1/main.py -j5 p 100 --arch resnet50 -b 256 --resume $2 --evaluate --fp16 /data/imagenet

+ 0 - 3
PyTorch/Classification/RN50v1.5/examples/RN50_FP16_INFERENCE_BENCHMARK.sh

@@ -1,3 +0,0 @@
-# This script launches ResNet50 inference benchmark in FP16 on 1 GPU with 256 batch size
-
-python ./main.py -j5 --arch resnet50 -b 256 --fp16 --benchmark-inference /data/imagenet

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP32_1GPU.sh

@@ -1,4 +0,0 @@
-# This script launches ResNet50 training in FP32 on 1 GPUs using 128 batch size (128 per GPU)
-# Usage ./RN50_FP32_1GPU.sh <path to this repository> <additional flags>
-
-python $1/main.py -j5 -p 500 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --lr 0.05 --epochs 90 $2 /data/imagenet

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP32_4GPU.sh

@@ -1,4 +0,0 @@
-# This script launches ResNet50 training in FP32 on 4 GPUs using 512 batch size (128 per GPU)
-# Usage ./RN50_FP32_4GPU.sh <path to this repository> <additional flags>
-
-python $1/multiproc.py --nproc_per_node 4 $1/main.py -j5 -p 500 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --lr 0.2 --warmup 5 --epochs 90 $2 /data/imagenet

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP32_8GPU.sh

@@ -1,4 +0,0 @@
-# This script launches ResNet50 training in FP32 on 8 GPUs using 1024 batch size (128 per GPU)
-# Usage ./RN50_FP32_8GPU.sh <path to this repository> <additional flags>
-
-python $1/multiproc.py --nproc_per_node 8 $1/main.py -j5 -p 500 --arch resnet50 -c fanin --label-smoothing 0.1 -b 128 --lr 0.4 --warmup 5 --epochs 90 $2 /data/imagenet

+ 0 - 4
PyTorch/Classification/RN50v1.5/examples/RN50_FP32_EVAL.sh

@@ -1,4 +0,0 @@
-# This script evaluates ResNet50 model in FP32 using 64 batch size on 1 GPU
-# Usage: ./RN50_FP32_EVAL.sh <path to this repository> <path to checkpoint>
-
-python $1/main.py -j5 p 100 --arch resnet50 -b 128 --resume $2 --evaluate /data/imagenet

+ 0 - 3
PyTorch/Classification/RN50v1.5/examples/RN50_FP32_INFERENCE_BENCHMARK.sh

@@ -1,3 +0,0 @@
-# This script launches ResNet50 inference benchmark in FP32 on 1 GPU with 128 batch size
-
-python ./main.py -j5 --arch resnet50 -b 128 --benchmark-inference /data/imagenet

+ 0 - 7
PyTorch/Classification/RN50v1.5/image_classification/__init__.py

@@ -1,7 +0,0 @@
-from . import logger
-from . import dataloaders
-from . import training
-from . import utils
-from . import mixup
-from . import resnet
-from . import smoothing

+ 0 - 271
PyTorch/Classification/RN50v1.5/image_classification/resnet.py

@@ -1,271 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-import numpy as np
-
-__all__ = ['ResNet', 'build_resnet', 'resnet_versions', 'resnet_configs']
-
-# ResNetBuilder {{{
-
-class ResNetBuilder(object):
-    def __init__(self, version, config):
-        self.config = config
-
-        self.L = sum(version['layers'])
-        self.M = version['block'].M
-
-    def conv(self, kernel_size, in_planes, out_planes, stride=1):
-        if kernel_size == 3:
-            conv = self.config['conv'](
-                    in_planes, out_planes, kernel_size=3, stride=stride,
-                    padding=1, bias=False)
-        elif kernel_size == 1:
-            conv = nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride,
-                             bias=False)
-        elif kernel_size == 5:
-            conv = nn.Conv2d(in_planes, out_planes, kernel_size=5, stride=stride,
-                             padding=2, bias=False)
-        elif kernel_size == 7:
-            conv = nn.Conv2d(in_planes, out_planes, kernel_size=7, stride=stride,
-                             padding=3, bias=False)
-        else:
-            return None
-
-        if self.config['nonlinearity'] == 'relu':
-            nn.init.kaiming_normal_(conv.weight,
-                    mode=self.config['conv_init'],
-                    nonlinearity=self.config['nonlinearity'])
-
-        return conv
-
-    def conv3x3(self, in_planes, out_planes, stride=1):
-        """3x3 convolution with padding"""
-        c = self.conv(3, in_planes, out_planes, stride=stride)
-        return c
-
-    def conv1x1(self, in_planes, out_planes, stride=1):
-        """1x1 convolution with padding"""
-        c = self.conv(1, in_planes, out_planes, stride=stride)
-        return c
-
-    def conv7x7(self, in_planes, out_planes, stride=1):
-        """7x7 convolution with padding"""
-        c = self.conv(7, in_planes, out_planes, stride=stride)
-        return c
-
-    def conv5x5(self, in_planes, out_planes, stride=1):
-        """5x5 convolution with padding"""
-        c = self.conv(5, in_planes, out_planes, stride=stride)
-        return c
-
-    def batchnorm(self, planes, last_bn=False):
-        bn = nn.BatchNorm2d(planes)
-        gamma_init_val = 0 if last_bn and self.config['last_bn_0_init'] else 1
-        nn.init.constant_(bn.weight, gamma_init_val)
-        nn.init.constant_(bn.bias, 0)
-
-        return bn
-
-    def activation(self):
-        return self.config['activation']()
-
-# ResNetBuilder }}}
-
-# BasicBlock {{{
-class BasicBlock(nn.Module):
-    M = 2
-    expansion = 1
-
-    def __init__(self, builder, inplanes, planes, stride=1, downsample=None):
-        super(BasicBlock, self).__init__()
-        self.conv1 = builder.conv3x3(inplanes, planes, stride)
-        self.bn1 = builder.batchnorm(planes)
-        self.relu = builder.activation()
-        self.conv2 = builder.conv3x3(planes, planes)
-        self.bn2 = builder.batchnorm(planes, last_bn=True)
-        self.downsample = downsample
-        self.stride = stride
-
-    def forward(self, x):
-        residual = x
-
-        out = self.conv1(x)
-        if self.bn1 is not None:
-            out = self.bn1(out)
-
-        out = self.relu(out)
-
-        out = self.conv2(out)
-
-        if self.bn2 is not None:
-            out = self.bn2(out)
-
-        if self.downsample is not None:
-            residual = self.downsample(x)
-
-        out += residual
-        out = self.relu(out)
-
-        return out
-# BasicBlock }}}
-
-# Bottleneck {{{
-class Bottleneck(nn.Module):
-    M = 3
-    expansion = 4
-
-    def __init__(self, builder, inplanes, planes, stride=1, downsample=None):
-        super(Bottleneck, self).__init__()
-        self.conv1 = builder.conv1x1(inplanes, planes)
-        self.bn1 = builder.batchnorm(planes)
-        self.conv2 = builder.conv3x3(planes, planes, stride=stride)
-        self.bn2 = builder.batchnorm(planes)
-        self.conv3 = builder.conv1x1(planes, planes * self.expansion)
-        self.bn3 = builder.batchnorm(planes * self.expansion, last_bn=True)
-        self.relu = builder.activation()
-        self.downsample = downsample
-        self.stride = stride
-
-    def forward(self, x):
-        residual = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-        out = self.relu(out)
-
-        out = self.conv3(out)
-        out = self.bn3(out)
-
-        if self.downsample is not None:
-            residual = self.downsample(x)
-
-        out += residual
-
-        out = self.relu(out)
-
-        return out
-# Bottleneck }}}
-
-# ResNet {{{
-class ResNet(nn.Module):
-    def __init__(self, builder, block, layers, num_classes=1000):
-        self.inplanes = 64
-        super(ResNet, self).__init__()
-        self.conv1 = builder.conv7x7(3, 64, stride=2)
-        self.bn1 = builder.batchnorm(64)
-        self.relu = builder.activation()
-        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
-        self.layer1 = self._make_layer(builder, block, 64, layers[0])
-        self.layer2 = self._make_layer(builder, block, 128, layers[1], stride=2)
-        self.layer3 = self._make_layer(builder, block, 256, layers[2], stride=2)
-        self.layer4 = self._make_layer(builder, block, 512, layers[3], stride=2)
-        self.avgpool = nn.AdaptiveAvgPool2d(1)
-        self.fc = nn.Linear(512 * block.expansion, num_classes)
-
-    def _make_layer(self, builder, block, planes, blocks, stride=1):
-        downsample = None
-        if stride != 1 or self.inplanes != planes * block.expansion:
-            dconv = builder.conv1x1(self.inplanes, planes * block.expansion,
-                                    stride=stride)
-            dbn = builder.batchnorm(planes * block.expansion)
-            if dbn is not None:
-                downsample = nn.Sequential(dconv, dbn)
-            else:
-                downsample = dconv
-
-        layers = []
-        layers.append(block(builder, self.inplanes, planes, stride, downsample))
-        self.inplanes = planes * block.expansion
-        for i in range(1, blocks):
-            layers.append(block(builder, self.inplanes, planes))
-
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-        x = self.conv1(x)
-        if self.bn1 is not None:
-            x = self.bn1(x)
-        x = self.relu(x)
-        x = self.maxpool(x)
-
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.layer4(x)
-
-        x = self.avgpool(x)
-        x = x.view(x.size(0), -1)
-        x = self.fc(x)
-
-        return x
-# ResNet }}}
-
-
-resnet_configs = {
-        'classic' : {
-            'conv' : nn.Conv2d,
-            'conv_init' : 'fan_out',
-            'nonlinearity' : 'relu',
-            'last_bn_0_init' : False,
-            'activation' : lambda: nn.ReLU(inplace=True),
-            },
-        'fanin' : {
-            'conv' : nn.Conv2d,
-            'conv_init' : 'fan_in',
-            'nonlinearity' : 'relu',
-            'last_bn_0_init' : False,
-            'activation' : lambda: nn.ReLU(inplace=True),
-            },
-        }
-
-resnet_versions = {
-        'resnet18' : {
-            'net' : ResNet,
-            'block' : BasicBlock,
-            'layers' : [2, 2, 2, 2],
-            'num_classes' : 1000,
-            },
-         'resnet34' : {
-            'net' : ResNet,
-            'block' : BasicBlock,
-            'layers' : [3, 4, 6, 3],
-            'num_classes' : 1000,
-            },
-         'resnet50' : {
-            'net' : ResNet,
-            'block' : Bottleneck,
-            'layers' : [3, 4, 6, 3],
-            'num_classes' : 1000,
-            },
-        'resnet101' : {
-            'net' : ResNet,
-            'block' : Bottleneck,
-            'layers' : [3, 4, 23, 3],
-            'num_classes' : 1000,
-            },
-        'resnet152' : {
-            'net' : ResNet,
-            'block' : Bottleneck,
-            'layers' : [3, 8, 36, 3],
-            'num_classes' : 1000,
-            },
-        }
-
-
-def build_resnet(version, config, model_state=None):
-    version = resnet_versions[version]
-    config = resnet_configs[config]
-
-    builder = ResNetBuilder(version, config)
-    print("Version: {}".format(version))
-    print("Config: {}".format(config))
-    model = version['net'](builder,
-                           version['block'],
-                           version['layers'],
-                           version['num_classes'])
-
-    return model

TEMPAT SAMPAH
PyTorch/Classification/RN50v1.5/img/DGX2_250_loss.png


TEMPAT SAMPAH
PyTorch/Classification/RN50v1.5/img/DGX2_250_top1.png


TEMPAT SAMPAH
PyTorch/Classification/RN50v1.5/img/DGX2_250_top5.png


TEMPAT SAMPAH
PyTorch/Classification/RN50v1.5/img/training_accuracy.png


TEMPAT SAMPAH
PyTorch/Classification/RN50v1.5/img/training_loss.png


TEMPAT SAMPAH
PyTorch/Classification/RN50v1.5/img/validation_accuracy.png


+ 0 - 0
PyTorch/Classification/RN50v1.5/resnet50v1.5/README.md


+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP16_250E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 256 --fp16 --static-loss-scale 128 --epochs 250 --mixup 0.2 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP16_50E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 256 --fp16 --static-loss-scale 128 --epochs 50 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP16_90E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 256 --fp16 --static-loss-scale 128 --epochs 90 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP32_250E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 128 --epochs 250 --mixup 0.2 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP32_50E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 128 --epochs 50 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX1_RN50_FP32_90E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 128 --epochs 90 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP16_250E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 256 --fp16 --static-loss-scale 128 --epochs 250 --mixup 0.2 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP16_50E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 256 --fp16 --static-loss-scale 128 --epochs 50 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP16_90E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 256 --fp16 --static-loss-scale 128 --epochs 90 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP32_250E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 128 --epochs 250 --mixup 0.2 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP32_50E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 128 --epochs 50 /data/imagenet

+ 0 - 1
PyTorch/Classification/RN50v1.5/resnet50v1.5/training/DGX2_RN50_FP32_90E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --data-backend pytorch --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace $1 -b 128 --epochs 90 /data/imagenet

+ 3 - 1
README.md

@@ -15,7 +15,9 @@ These examples, along with our NVIDIA deep learning software stack, are provided
 The examples are organized first by framework, such as TensorFlow, PyTorch, etc. and second by use case, such as computer vision, natural language processing, etc. We hope this structure enables you to quickly locate the example networks that best suit your needs. Here are the currently supported models:
 
 ### Computer Vision
-- __ResNet-50__ [[MXNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5)] [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/RN50v1.5)] [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/RN50v1.5)]
+- __ResNet-50__ [[MXNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5)] [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets)] [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/RN50v1.5)]
+- __ResNext__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets)]
+- __SE-ResNext__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets)]
 - __SSD__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Detection/SSD)] [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Detection/SSD)]
 - __Mask R-CNN__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN)]
 - __U-Net(industrial)__ [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_Industrial)]