©1998 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. 
Chris Basoglu, Member, IEEE, and Yongmin Kim, Fellow, IEEE
Abstract  A onequarter electrical and computer engineering course on advanced microprocessors and realtime image/video computing algorithms is described in this paper. It is primarily aimed at graduate students who have some background in computer architecture, signal and image processing, and experience in lowlevel programming of microprocessors. The course is designed to teach students not only the advanced microprocessors with instructionlevel parallelism, e.g., superscalar and very long instruction word (VLIW) computer architectures, but also how to analyze and efficiently map various image and video computing algorithms on these microprocessors. Emphasis is placed on laboratory and individual project work requiring an indepth study, implementation, testing and verification, demanding intensive participation from both the students and instructor. It is continually updated and improved to reflect new technologies in this dynamically changing area and to incorporate student feedback. This course has been offered for four quarters at the University of Washington.
As all forms of media such as film, audio, and video become digital, the need for the acquisition, processing, display and communications of such multimedia data has been spurring rapid technology developments. By taking advantage of these new technologies, many new applications have become possible and existing applications are strengthened with improving cost/performance ratios. Digital video, desktop video teleconferencing, machine vision, medical imaging and digital cameras are several such examples. Image and video computing algorithms and their fast implementations are a few of the core enabling technologies for such applications. Because of the vast market potential, consumer products employing realtime digital video computing have been generating a great deal of excitement among the manufactures and users. The realtime aspect and consumerlevel focus of such systems requires high computational power at very low cost and optimal implementations of key algorithms. In prototyping and designing these systems, a programmable approach provides flexibility and adaptability to new applications and changing requirements, which is a definite advantage over specialized hardwired solutions.
There is an everincreasing need in industry for students who can efficiently implement image and video computing algorithms on commercially available microprocessors. We feel that most computer engineering and image processing courses including ours at the University of Washington do not sufficiently involve the students in actual realtime algorithm implementation. Most courses use highlevel image and data processing tools like Khoros and Matlab to illustrate various fundamental concepts in image and video computing [12]. While the understanding gained from such highlevel tools is valuable, these students lack the training and handson experience necessary to implement computationally intensive image/video computing algorithms in realtime systems.
Those courses which do cover the implementation aspects of computeintensive algorithms typically focus on parallel processing and distributed computing on massivelyparallel machines [35]. Many people in academia and industry have noted the lack of practical experience afforded to engineering students in this field, and some have gone as far as to make massivelyparallel machines available for student laboratories [5]. While these parallel machines have provided the students with useful handson experience in parallel programming, the system and maintenance costs are very high, and they are not the machines commonly used in industry in implementing digital image and video systems. Others have used transputer networks [68], software simulators [910], or "pencil and paper" experiments [11] to provide a parallel programming environment for their students.
Several recent commerciallyavailable general purpose processors and digital signal processors (DSPs) have been designed for the demanding computing needs in processing digital video, images, graphics and audio to enable new consumerlevel applications [13]. These new processors employ instructionlevel parallelism which includes the superscalar and very long instruction word (VLIW) computer architectures. Instructionlevel parallelism allows for multiple CPU operations to be initiated in a single clock cycle. This is done by having multiple execution units onchip and/or by partitioning the ALU of a particular execution unit into multiple independent pieces (e.g., a 32bit ALU split into two 16bit ALUs). Intel Pentium Pro, HewlettPackard PARISC 8000, Sun Microsystems UltraSPARC, and Texas Instruments TMS320C80 are all recentlydeveloped superscalar/VLIW processors designed to meet the increasing computing requirement due to the proliferation of multimedia. Table 1 lists various characteristics of these advanced microprocessors.
Table 1
Current commercially available superscalar/VLIW microprocessors and DSPs.
Processor

Current clock frequency

Num. of operations issued per cycle

Availability of partitioned addition

Num. of multiplies issued per cycle

Die size

Process technology

Intel Pentium Pro

200 MHz

3

NO

1

195 mm^{2 }

0.35 µm

Motorola PowerPC 604

150 MHz

4

NO

1

196 mm^{2 }

0.5 µm

HP PA RISC 8000

200 MHz

4

YES

2

345 mm^{2 }

0.5 µm

Sun UltraSPARC I

200 MHz

4

YES

2

315 mm^{2 }

0.5 µm

TI TMS320C80

50 MHz

18

YES

5

342 mm^{2 }

0.6 µm

Several new courses have been developed to cover such architectures and processors [8, 14], emphasizing more on the hardware aspects. However, we believe that realtime implementations require a systems approach including software, hardware, and integration. Thus, we have taken a balanced approach in our course that the programmers must understand the underlying hardware which executes their software.
The course is divided into two parts: three hours/week lectures and open laboratory where the students work on the laboratory exercises, homework problems and an individual project on the average of 10 to 20 hours a week. The laboratory exercises are designed to exemplify the topics covered in class. An individual project is a full implementation of an image or video computing algorithm with medium complexity on a real multimedia system. We require that the student's project should execute as efficiently as possible on the given hardware.

Fig. 1. The course syllabus.
The syllabus of the course most recently offered is in Fig. 1 which lists the major topics covered. The course starts with a brief overview of the current status and future trends of image and video computing. The computational power required in systems capable of supporting realtime image and video computing is analyzed. Previously, we at the University of Washington have developed two systems for realtime multimedia processing, the MediaStation 5000 (MS5000) and the Programmable Ultrasound Image Processor (PUIP) [1516]. These systems are used as reallife examples of the topics taught in class and are target platforms for the students' projects.
The general theory and applications of image and video computing algorithms are then covered. This includes the algorithms and various tasks required for JPEG, H.320, and MPEG encoding and decoding [17]. A prerequisite course on the fundamentals of image processing covers these aspects more in detail. These algorithms are then used as examples to illustrate their efficient implementations on the different processor types, e.g., students are taught on how take advantage of the data independence within each algorithm by utilizing the superscalar or VLIW processor's ability to perform multiple operations concurrently. Furthermore, the problem of implementing these algorithms via fixedpoint arithmetic and our solution to this problem are also covered.
Fig. 2. A CooleyTukey FFT butterfly flowgraph and its computational and I/O requirements.
One of these algorithms covered is the CooleyTukey FFT butterfly as shown in Fig. 2. On a processor with no instructionlevel parallelism, an FFT butterfly takes 15 cycles because there are 15 basic operations in the flowgraph. The successive butterflies are independent of each other. Thus, on a processor with instructionlevel parallelism, independent butterflies can be computed in parallel if the execution order of operations in the butterfly is changed. For example, a threeissue superscalar/VLIW processor with the capability of simultaneously issuing one multiply, one add/subtract and one load/store could perform an add/subtract operation for the current butterfly, a multiply operation for the previous butterfly, and a load operation for the next butterfly. If these operations are repeated in a loop, three consecutive butterflies in the same decomposition stage can be computed in parallel. We have found that these kinds of examples are effective in teaching the instructionlevel parallelism and parallel programming, especially when students see significant speedups in those algorithms that they are familiar with.
Next, the hardware aspects of the modern superscalar and VLIW computer architectures are covered, and the advantages and disadvantages they confer on realtime algorithm implementations are discussed. Concepts such as instruction pipelining, instruction buffering, outoforder dispatch, outoforder execution, reservation stations, dynamic branch prediction and register renaming are introduced with examples. Some of the issues heavily impacting the system performance are then discussed, e.g., a discussion on data flows when onchip instruction cache, data cache, and/or DMAcontrolled SRAM is present. Another example is a review of various external memory types (SRAM, DRAM, VRAM, EDO, SDRAM, and RAMBUS). Also, multitasking operating systems and their use in task management, communications, and synchronization in implementing threaded applications, messages, software ports, and semaphores are presented from the user's point of view.
The Power PC 604 microprocessor is used as a real world example of a superscalar microprocessor [18]. It contains two independent singlecycle integer units, one complex multicycle integer unit, a load/store unit, a branch unit, and a floatingpoint unit. The two singlecycle units are responsible for the execution of all integer operations with the exception of multiplications and divisions which are handled in the complex multicycle integer unit. The PowerPC 604 can dispatch up to four instructions per clock period to four out of the six onchip execution units. To explain how algorithms can be efficiently implemented on such an architecture, all the stages in the Power PC's 6stage instruction pipeline, Fetch, Decode, Dispatch, Execute, Completion, and Writeback, are discussed in detail. Sample code segments from several image/video computing algorithms are used to show efficient coding on this architecture.
Fig. 3. The block diagram of the TMS320C80 multiprocessor.
The Texas Instruments TMS320C80 is used as an example of both a RISC processor and a VLIW processor. The TMS320C80 a singlechip multiprocessor designed for highspeed image processing and realtime multimedia applications [19]. For example, the processor is capable of performing realtime MPEG1 compression and decompression [15]. Fig. 3 shows the block diagram of the TMS320C80. The TMS320C80 contains a RISC processor called the Master Processor (MP), four Advanced Digital Signal Processors (ADSPs; each ADSP is a DSP with a VLIW architecture), and a programmable DMA controller called the Transfer Controller (TC). The MP has a floatingpoint unit which can issue floatingpoint instructions on every cycle. Each ADSP has a 16bit fixedpoint multiplier, a 3input 32bit ALU, a branch unit and two load/store units. The programmable DMA controller allows various types of complex address calculations and data transfers.
All five onchip processors in the TMS320C80 are capable of executing multiple operations per cycle. Each ADSP can execute one 16bit multiplication (that can be partitioned into two 8bit multiplications), one 32bit arithmetic and logical operation (partitionable into two 16bit or four 8bit operations), and two load/store operations in the same cycle. All fixedpoint operations, including multiplications, are performed in a single cycle. The MP is capable of executing a floatingpoint addition and multiplication as well as a load/store operation in parallel. Assuming that the MP executes two operations and each ADSP executes four parallel operations, a total of 18 concurrent operations in the TMS320C80 per cycle can be sustained. For example, in executing the FFT butterfly in Fig. 2, it takes each ADSP only 6 cycles to perform the 15 basic operations. Thus, the TMS320C80 can compute an FFT butterfly in 1.5 cycles. For the two dimensional FFT of a 512×512 complex input image, the execution time of the butterfly tight loop on the TMS320C80 running at 50 MHz can be computed as:
(1.5 cycles/butterfly) * (256 * 9 butterflies/row) * (2 * 512 rows) * (20 ns/cycle) = 70.8 ms
(Eq. 1)
The actual measured execution time of the 2D FFT for a 512×512 image is 75.0 ms, so the overhead is low at 5.9%.
Because the TMS320C80's DMA controller is powerful and can be programmed very flexibly, the data transfers can be matched to the needs of a specific algorithm. In this way, students can experiment with different data flows and select the ones which provide the best performance for their algorithms. This additional burden to the programmer on data flow management, however, also makes the lowlevel programming of the TMS320C80 challenging. The University of Washington Image Computing Library (UWICL) with over 150 image processing functions which was developed for use in TMS320C80based systems is introduced and made available to the students as examples of efficiently implemented algorithms [20]. It includes lowlevel functions ranging from convolution and morphology to wavelets, FFT, arithmetic coding and warping. The UWICL's highly modularized design and good documentation make its use straightforward. For example, in developing a highlevel algorithm or application, students can use the existing routines from this library and implement only those modules that are not available in the library.
Five predesigned laboratories are assigned to introduce actual algorithms and application implementations and to provide handson experience to the students. For software development, the Texas Instruments TMS320C80 compiler, assembler, simulator and debugger are made available to the students. The debugger interface allows the students to inspect the contents of all the registers and memory locations (onchip and offchip) on a cyclebycycle basis. For software verification, Khoros and Matlab are used. Two MS5000 systems allow the students to run their programs on real hardware. Because the MS5000 is a real highperformance multimedia system incorporating various design tradeoffs to balance performance and cost, the students are required to understand these tradeoffs (e.g., external memory types and bus widths used) and take them into consideration when working on their assignments.
The first assignment is intended to familiarize the students with various superscalar and VLIW processors on the market today. The students are given a list of processors and pointers to their Internet sites where detailed technical information can be found. Each student must choose a processor and write a report on the processor. The student must then report their findings to the class in a short talk. During the presentations, the instructor tries to point out both the common and unique architectural features of the processors.
The second assignment is on the development environment in the laboratory. The students must compile and assemble an existing routine, execute it on both the simulator and on the hardware, and then verify the correctness of the routine. This verification has to be performed by developing and running the functionallyequivalent routine in Khoros, Matlab and C, and comparing the outputs of these three environments with the output of the TMS320C80 via another C program to compute the mean absolute difference and mean square error.
The third assignment handles data flows. The students have to implement a jumble game. The game consists of dividing an image into a series of small rectangles which must be jumbled randomly. One rectangle is chosen to be left out, and is represented by painting it black. The game is solved by moving the rectangles around until the original image is regenerated. An example jumbled output from this assignment is shown in Fig. 4. This assignment is designed to teach the students how to efficiently input and output twodimensional data and how to build in C and assembly language an application that responds interactively to the user inputs.
Fig. 4. An output from the jumble game assignment.
The students have to build the same jumble game as before, but now each rectangle can be rotated by 90 degrees as well as being moved. In rotating a rectangle, they are required to divide the rectangle into four horizontal or vertical slices so that the four ADSPs of the TMS320C80 can process one slice each in parallel. The processing on each ADSP must be made as parallel as possible by utilizing the instructionlevel parallelism. Furthermore, each slice must be divided into small subslices so that double buffering can be used to reduce the overall execution time by hiding the I/O time behind the processing time.
The final assignment is devoted to partitioned arithmetic operations. In implementing a function for image thresholding, the students must use partitioned arithmetic so that four 8bit pixels can be processed in parallel by each ADSP as shown in Fig. 5. Four pixels at a time are subtracted from the threshold value. Four 1bit flags are used to store whether or not a borrow condition has occurred in each 8bit subtraction. These flags are then used to output either 0 or 255. Since four output values are generated every two instruction cycles per ADSP, eight pixels can be processed per cycle with all four ADSPs working in parallel.
Fig. 5. The use of partitioned arithmetic to implement image thresholding.
This course is culminated by the projects on image and video computing algorithm implementations. Each student works individually on his/her own project. In the first few weeks of the quarter, each student is asked to identify a suitable image or video computing algorithm to implement and obtain an approval from the instructor. Each student is expected to do independent research on an interesting and appropriate algorithm. Neither existing UWICL library functions nor previous student projects are allowed. Those students needing help in selecting a project are encouraged to interact with the instructor.
A written project proposal of about five pages is required from each student during the fifth week. The proposal must cover the theory behind the algorithm, how the algorithm will be mapped efficiently on the TMS320C80, the expected performance, and the proposed schedule. Each student makes a 15 minute presentation on the proposal to the class. The students work on their projects until the last day of final exam (eleventh) week.
The projects to date have been quite extensive in scope and variety. Some became the subject of further research [21]. A few sample projects are as follows:
Because of the number of hours the students need to spend in the laboratory and the limited laboratory resources, the course enrollment has been limited to a maximum of 12 students per quarter. We found that it is important that the students are adequately prepared for the course. Thus, we ask prospective students to apply for the course. They are screened based on their prerequisite background and their likelihood of success in this course before they are allowed to register for the course.
Students get excited towards the end of the five laboratories after obtaining some handson experience and having seen even their novice code running much faster than highend workstations and PCs. After completing the course, quite a few students have become attracted to research on some of the topics discussed in the course. At the same time, we have found that the industry is very eager to recruit these students after their course completion. This is because this course has bridged the gap between advanced microprocessors and image/video computing algorithms and their realtime implementations by specifically teaching the students how to analyze and optimally implement various image and video computing algorithms on processors with instructionlevel parallelism, which can be otherwise very challenging. Many pieces of equipment and software development tools were donated by the industry. Most students after successfully passing the course commented that while the course may have been their most "intense" academic experience to date, the unique knowledge, experience and insight they have gained from the course were extremely valuable, and their efforts on the course were worthwhile.
[1] G. W. Donohoe and P. F. Valdez, "Teaching digital image processing with Khoros," IEEE Transactions on Education, vol. 39, pp. 137142, 1996.
[2] J. F. Arnold and M. C. Cavenor, "A practical course in digital video communications based on MATLAB," IEEE Transactions on Education, vol. 39, pp. 127136, 1996.
[3] W. E. Toll, "Decision points in the introduction of parallel processing into the undergraduate curriculum," SIGCSE Bulletin, vol. 27, no. 1, pp. 136140, 1995.
[4] C. H. Nevison, "Parallel computing in the undergraduate curriculum," Computer, vol. 28, no. 12, pp. 5156, 1995.
[5] R. J. Duckworth, "Introducing parallel processing concepts using the MASPAR MP1 computer," SIGCSE Bulletin, vol. 26, no. 1, pp. 353356, 1994.
[6] N. C. Schaller and A. T. Kitchen, "Experiences in teaching parallel computing  five years later," SIGCSE Bulletin, vol. 27, no. 3, pp. 1520, 1995.
[7] F. C. Berry, "An undergraduate parallel processing laboratory," IEEE Transactions on Education, vol. 38, pp. 306311, 1995.
[8] V. L. Narasimhan, "A new course on supercomputers and parallel architectures," IEEE Transactions on Education, vol. 38, pp. 340345, 1995.
[9] B. S. Elenogen, "Parallel and distributed algorithms laboratory assignments in Joyce/Linda," SIGCSE Bulletin, vol. 28, no. 1, pp. 1417, 1996.
[10] E. A. Lee, "Computing and signal processing: An experimental multidisciplinary course," IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 6, pp. 4548, 1994.
[11] L. Yang and L. Jin, "Integrating parallel algorithms with parallel machine models," SIGCSE Bulletin, vol. 27, no. 1. pp. 131135, 1995.
[12] C. Basoglu, W. Lee, and Y. Kim, "An efficient FFT algorithm for superscalar and VLIW processor architectures," RealTime Imaging, submitted, Feb. 1996
[13] P. Pirsch, N. Demassieux, and W. Gehrke, "VLSI architectures for video compression  A survey," Proceedings of the IEEE, vol. 83, pp. 220245, 1995.
[14] W. E. Mattis, "An advanced microprocessor course with a design component," SIGCSE Bulletin, vol. 27, no. 4, pp. 6064, 1995.
[15] W. Lee, Y. Kim, R. J. Gove, and C. J. Read, "MediaStation 5000: Integrating video and audio," IEEE Multimedia, vol. 1, no. 2, pp. 5061, 1994.
[16] Y. Kim, J. Kim, C. Basoglu, and T.C. Winter, "Programmable ultrasound imaging using multimedia technologies: a next generation ultrasound machine," IEEE Transactions on Information Technology and Biomedicine, vol. 1, pp. 1929, 1997.
[17] D. LeGall, "MPEG: A video compression standard for multimedia applications," Communications of the ACM, vol. 34, no. 4, pp. 4658, 1991.
[18] S. P. Song, M. Denman, and J. Chang, "The PowerPC 604 RISC microprocessor," IEEE Micro, vol. 14, no. 5, pp. 817, 1994.
[19] K. Guttag, R. J. Gove, and R. J. VanAken, "A single chip multiprocessor for multimedia: The MVP," IEEE Computer Graphics and Applications, vol. 12, no. 6, pp. 5364, 1992.
[20] J. Kim and Y. Kim, "UWICL: A multilayered parallel image computing library for singlechip multiprocessorbased timecritical systems," RealTime Imaging, vol. 2, pp. 187199, 1996.
[21] S. D. Pathak, Y. Kim, and J. Kim, "Efficient implementation of facet models on a multimedia system," Optical Engineering, vol. 35, pp. 17391745, 1996.
[22] P. Mattson, C. Basoglu, and Y. Kim, "Interactive image morphing on a singlechip microprocessor using a multilayered parallel image computing library," RealTime Imaging, in press, 1998.
Chris Basoglu
Equator Technologies, Inc.
520 Pike Street, Suite 900
Seattle, WA 98101
Phone: (206) 8121290, Ext. 210
Fax: (206) 8121285
Email: boz@equator.com
Yongmin Kim
Department of Electrical Engineering, Box 352500
University of Washington
Seattle, WA 981952500
Phone: (206) 6852271
Fax: (206) 5433842
Email: kim@ee.washington.edu
Yongmin Kim (S79M82SM87F96) is a Professor of Electrical Engineering and an Adjunct Professor of Bioengineering, Radiology, and Computer Science and Engineering at the University of Washington. His research interests are in algorithms and systems for multimedia, image processing, computer graphics, medical imaging, highperformance programmable processor architecture, and modeling and simulation. Dr. Kim holds a B.S. in electronics engineering from Seoul National University, and an M.S. and a Ph.D. in electrical and computer engineering from the University of Wisconsin. He received the Early Career Achievement Award from the IEEE Engineering in Medicine and Biology Society in 1988.
He is a member of the Editorial Board of the IEEE Transactions on Biomedical Engineering, the IEEE Transactions on Information Technology in Biomedicine, the IEEE Press series in Biomedical Engineering, the Journal of Multimedia Tools and Applications, Telemedicine Journal, and Annual Reviews of Biomedical Engineering. His group has filed 29 invention disclosures and 16 patents, and 17 commercial licenses have been signed. He is a Fellow of the IEEE and a Fellow of the American Institute for Medical & Biological Engineering.
Chris Basoglu was born in Istanbul Turkey on Jan 8, 1971. He received a B.S. degree in electrical engineering at the University of Washington, Seattle, WA. He completed his Ph.D. degree in electrical engineering at the University of Washington, Seattle, WA under the direction of professor Yongmin Kim in September 1997.
His research interests include optimal algorithm mapping to processor architectures supporting various types of parallelism, ultrasound processing algorithm design, and DSP algorithm design. He has taught both University graduate students and industry professionals image processing algorithms, modern microprocessor architectures and on methods to optimally map performance critical algorithms.